This allows us to implement backend specific workarounds and use the
more appropriate device specific flushing.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Fixes for resubmitting batches after running out of space for vertex
buffers and also a couple of trivial spans functions.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The gen4+ spec is a little misleading as states that all BLT pitches for
the XY commands are in dwords. Apparently not, as the upload/download
functions were already demonstrating. This only became apparent when
accelerating core text routines to offscreen pixmaps, such as composited
windows.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we test the area to be drawn against the existing CPU damage and find
it is already on the CPU, we may as well continue to utilize that
damaged region.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If either the region is busy on the gpu or if we need to read the
destination then we would incur penalties for trying to perform the
operation through the GTT. However, if we are simply streaming pixels to
an unbusy bo then we can do so inplace faster than computing the
corresponding GPU commands and uploading them.
Note: currently it is universally slower to use the GPU here (the
computation of the spans is too slow). However that is only according to
micro-benchmarks, avoiding the readback is likely to be more efficient
in practice.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If either the region is busy on the gpu or if we need to read the
destination then we would incur penalties for trying to perform the
operation through the GTT. However, if we are simply streaming pixels to
an unbusy bo then we can do so inplace faster than computing the
corresponding GPU commands and uploading them.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The damage layer was detecting that we were asking it to accumulate a
degenerate box emanating from PolySegment, as the unclipped paths made
the fatal assumption that it would not need to filter out degenerate
boxes. However, a degenerate line becomes a point, does the same apply
to a degenerate segment?
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
A few of the create_elts() routines missed marking the damage as dirty
so that if only part of the emebbed box was used (i.e. the damage
contained less than 8 rectangles that needed to included in the damage
region) then those were being ignored during migration and testing.
Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
References: https://bugs.freedesktop.org/show_bug.cgi?id=44682
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If the write operation fills the entire clip, then we can demote and
possible avoid having to read back the clip from the GPU provided that
we do not need the destination data due to arithmetic operation or mask.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The damage tracking code asserts that it only handles clip regions.
However, sna_copy_area() was failing to ensure that its damage region
was being clipped by the source drawable, leading to out of bounds reads
during forced fallback.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we decide to do the CPU fallback inplace on the GPU bo through a WC
mapping (because it is a large write-only operation), make sure that
the new GPU bo we create is not active and so will not^W^W is less likely
to cause a stall when mapped.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When reducing the damage we may find that it is actually empty and so
sna_damage_get_boxes() returns 0, be prepared.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cachelines will only be dirtied for the bytes accessed so a better
metric would based on the total number of pages brought into the TLB
and the total number of cachelines used. Base the decision on whether
to try and amalgamate the upload with others on the number of bytes
copied rather than the overall extents.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
All of the asserts and debug options that lead me to believe that the
tiling was completely screwy for some writes.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
search_linear_cache() was updated to track the first good match whilst it
continued to search for a better match. This resulted in the first good
bo being modified and a record of those modifications lost, in
particular the change in tiling.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
FORCE_GPU_ONLY now has no effect except for marking the initial pixmap
as all-damaged on the GPU, and so not testing the paths for which it was
originally introduction.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Just in case we set a mode then fail to emit any dwords. Sounds
inefficient and woe betide the culprit when I find it...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We fudge forced used of the BLT ring unless we install a render backend
and so we must also prevent the ring from being reset when the GPU is
idle. Therefore we make handing the ring status a backend function.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we do not have access to an accelerated render backend, only create
GPU buffers for the scanout and use an accelerated blitter for
upload/download and operating inplace on the scanout.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Found by valgrind:
==13639== Conditional jump or move depends on uninitialised value(s)
==13639== at 0x5520B1E: pixman_region_init_rects (in
/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.24.0)
==13639== by 0x89E6ED7: __sna_damage_reduce (sna_damage.c:489)
==13639== by 0x89E7FEC: _sna_damage_contains_box (sna_damage.c:1161)
==13639== by 0x89CFCD9: sna_drawable_use_gpu_bo (sna_damage.h:175)
==13639== by 0x89D52DA: sna_poly_segment (sna_accel.c:6130)
==13639== by 0x21F87E: damagePolySegment (damage.c:1096)
==13639== by 0x1565A2: ProcPolySegment (dispatch.c:1771)
==13639== by 0x159FB0: Dispatch (dispatch.c:437)
==13639== by 0x1491D9: main (main.c:287)
==13639== Uninitialised value was created by a heap allocation
==13639== at 0x4028693: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13639== by 0x89E6BFB: _sna_damage_create_boxes (sna_damage.c:205)
==13639== by 0x89E78F0: _sna_damage_add_rectangles (sna_damage.c:327)
==13639== by 0x89CD32D: sna_poly_fill_rect_blt.isra.65
(sna_damage.h:68)
==13639== by 0x89DE23F: sna_poly_fill_rect (sna_accel.c:8366)
==13639== by 0x21E9C8: damagePolyFillRect (damage.c:1309)
==13639== by 0x26DD3F: miPaintWindow (miexpose.c:674)
==13639== by 0x18370A: ChangeWindowAttributes (window.c:1553)
==13639== by 0x154500: ProcChangeWindowAttributes (dispatch.c:696)
==13639== by 0x159FB0: Dispatch (dispatch.c:437)
==13639== by 0x1491D9: main (main.c:287)
==13639==
Use 'count' everywhere for consistency.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Detected by valgrind:
==22012== Source and destination overlap in memcpy(0xd101000, 0xd101000,
783360)
==22012== at 0x402A180: memcpy (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22012== by 0x89BD4ED: memcpy_blt (blt.c:209)
==22012== by 0x89F2921: sna_write_boxes (sna_io.c:364)
==22012== by 0x89CFABF: sna_pixmap_move_to_gpu (sna_accel.c:1900)
==22012== by 0x89F49B0: sna_render_pixmap_bo (sna_render.c:571)
==22012== by 0x8A268CE: gen5_composite_picture (gen5_render.c:1908)
==22012== by 0x8A29B8A: gen5_render_composite (gen5_render.c:2252)
==22012== by 0x89E6762: sna_composite (sna_composite.c:485)
==22012== by 0x21D3C3: damageComposite (damage.c:569)
==22012== by 0x215963: ProcRenderComposite (render.c:728)
==22012== by 0x159FB0: Dispatch (dispatch.c:437)
==22012== by 0x1491D9: main (main.c:287)
==22012==
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Instead of checking for CPU generation, use the libdrm-provided
I915_PARAM_HAS_LLC instead.
v2: use a define check to verify if we have I915_PARAM_HAS_LLC.
Signed-off-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The previous commit undoes a premature optimisation that assumed that
the current damage captured all pixels written. However, it happens to
be a useful optimisation along that path (tracking upload of partial
images), so add the necessary booking that watches for when the union
of cpu and gpu damage is no longer the complete set of all pixels
written, that is if we either migrate from one pixmap to the other, the
undamaged region goes untracked. We also take advantage of whenever we
damage the whole pixel to restore knowledge that our tracking of all
pixels written is complete.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we discard the CPU bo, we lose knowledge of whatever regions had been
initialised but no longer dirty on the GPU, but instead must assume that
the entirety of the GPU bo is dirty.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Put some markers into the debug log as those functions create many
proxies causing a lot of debug noise.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
For SNB, in case you really, really want to use GPU detiling and not
incur the ring switch. Tweaking when to just mmap the target seems to
gain most anyway...
The ulterior motive is that this provides fallback paths for avoiding
the use of TILING_Y with GTT mmaps which is broken on 855gm.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We only want to create huge pwrite buffers when populating the inactive
cache for mmapped uploads. In the absence of using mmap for upload, be
more conservative with the alignment value so as not to simply waste
valuable aperture and memory.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If we steal a write buffer for creating a pixmap for read back, then we
need to be careful as we will have set the used amount to 0 and then try
to incorrectly decrease by the last row. Fortunately, we do not yet have
any code that attempts to create a 2d buffer for reading.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The batch may legitimately be submitted prior to the attachment of the
read buffer, if, for example, we need to switch rings. Therefore update
the assertion to only check that the bo remains in existence via either
a reference from the exec or from the user
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When backporting the patches from gen6, I didn't notice the memset that
came later, and this wasn't along the paths checked by rendercheck.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Whilst we can render to and blend with an depth 15 target, we cannot use
it as a texture with the sampling engine.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The hw wants it as demonstrated by the '>' in KDE's menus. Why is it
always KDE that demonstrates coherency problems...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
One of the side-effects of emitting the composite state is that it
tags the destination surface as dirty as a result of the *forthcoming*
operation. So emitting the flush after emitting the composite state
clears that tag, so we need to restore it for future coherency.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When creating native glamor pixmaps we will get much better performance
than using the textured-drm pixmap, this commit is to make that the
default behaviour when configured to use glamor. Another advantage
of this commit is that we reduce the risk of encountering the
"incompatible region exists for this name" and the associated
render corruption. And since we now never intentionally allocate
a reusable pixmap we could just make all (intel_glamor) allocations
non-reusable without incurring too great an overhead.
A side effect is that those glamor pixmaps do not have a
valid BO attached to them and thus it fails to get a DRI drawable. This
commit also fixes that problem by adjusting the fixup_shadow mechanism
to recreate a textured-drm pixmap from the native glamor pixmap. I tested
this with mutter, and it works fine.
The performance gain to apply this patch is about 10% to 20% with
different workload.
Signed-off-by: Zhigang Gong <zhigang.gong@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
If the GPU and CPU caches are shared and coherent, we can use a cached
mapping for linear bo in the CPU domain with no penalty and so avoid the
penalty of using WC/UC mappings through the GTT (and any aperture
pressure). We presume that the bo for such mappings are indeed LLC
cached...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In how many different ways can we check that the scanout is allocated
before we start decoding video?
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>