If we reject the front buffer because it has too large a stride, repeat
the allocation using untiled for the cases where we can utilize laxer
hardware restrictions.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Avoid a potential use-after-free of the copied mode string by reusing
the converted kernel mode on resize.
==19897== Invalid read of size 8
==19897== at 0x661C330: ??? (strcpy.S:1308)
==19897== by 0x8618AE7: drmmode_set_mode_major (drmmode_display.c:293)
==19897== by 0x8618E6F: drmmode_xf86crtc_resize (drmmode_display.c:1299)
==19897== by 0x529A77: xf86RandR12ScreenSetSize (xf86RandR12.c:708)
==19897== by 0x4BD528: ProcRRSetScreenSize (rrscreen.c:301)
==19897== by 0x42B820: Dispatch (dispatch.c:432)
==19897== by 0x4254C9: main (main.c:289)
==19897== Address 0x72e91e0 is 0 bytes inside a block of size 9 free'd
==19897== at 0x4C23DBC: free (vg_replace_malloc.c:325)
==19897== by 0x48424F: xf86DeleteMode (xf86Mode.c:1921)
==19897== by 0x4942B7: xf86ProbeOutputModes (xf86Crtc.c:1572)
==19897== by 0x5290BB: xf86RandR12GetInfo12 (xf86RandR12.c:1551)
==19897== by 0x5313AE: RRGetInfo (rrinfo.c:202)
==19897== by 0x4BCCAA: rrGetScreenResources (rrscreen.c:337)
==19897== by 0x42B820: Dispatch (dispatch.c:432)
==19897== by 0x4254C9: main (main.c:289)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Unwind the array of Pixmaps already allocated and report failure for the
old dri GetBuffers() path.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
After splitting out the i810 driver into its own legacy directory, we
can identify the common routines not as i830 but as intel. This
clarifies the code which *is* i830 specific.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The driver is still built but is no longer under active development so
move it and supporting files to a new directory.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Tiling on gen 2/3 hardware is only supported for pitches up to 8192
bytes, so above this limit the surface will be untiled and we will no
longer have to comply with the power-of-two pitch alignment. So
disabling tiling for these too wide surface should ~halve the memory
requirement for the full surface.
Also the absolute limit for the 2D blitter is 32,768 bytes. The
documentation says "up to 32,768 bytes" and my PineView box was
malfunction with a surface stride of 32,768 so set the limit to be
32,767.
References:
Bug 28497 - Graphics corruption after opening a specific website
https://bugs.freedesktop.org/show_bug.cgi?id=28497
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Oops, I spent more time discussing these flushing bugs than I spent
paying attention to what I was actually doing.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This is a situation that should not be possible, need_mi_flush being
true but the list of pending flush pixmaps being clear. However, an
earlier bug in doing just that revealed this minor bug. So for
correctness, be careful not to clear need_mi_flush without emitting a
MI_FLUSH.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The key difference between i965 and earlier, is that the surfaces passed
to the samplers through an indirect table and so the batch and render
target was not being marked dirty by the relocation (since the
relocation only happens within prepare_composite() which may have been
in another batch.) Simply call intel_pixmap_mark_dirty() when binding
the sampler table into the batch to ensure that the dirty is tracked
appropriately.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As the batch submit may not trigger further drawing through flushing the
vertices, pass the requirement to emit the flush down to the submission
routine so that the flush can be appended after the final commands.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The presumption is that we wish to keep the target hot, so
copy to a new bo and move that to the CPU in preference to
causing ping-pong of the original.
Also the gpu is much faster at detiling.
Before (PineView):
400000 trep @ 0.1128 msec ( 8860.0/sec): GetImage 10x10 square
18000 trep @ 1.3839 msec ( 723.0/sec): GetImage 100x100 square
800 trep @ 30.0987 msec ( 33.2/sec): GetImage 500x500 square
After: (PineView)
180000 trep @ 0.1478 msec ( 6770.0/sec): GetImage 10x10 square
60000 trep @ 0.4545 msec ( 2200.0/sec): GetImage 100x100 square
4000 trep @ 8.0739 msec ( 124.0/sec): GetImage 500x500 square
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Use a single code path to upload the image data after selecting the
right bo, and take advantage of pwrite() when possible.
Fixes:
Bug 28569 - [i965] IGN's flash-based video player crashes X
https://bugs.freedesktop.org/show_bug.cgi?id=28569
Bug 28573 - [i965] Fullscreen flash and windowed SDL games fail to
update the screen
https://bugs.freedesktop.org/show_bug.cgi?id=28573
Reported-and-tested-by: Brian Rogers <brian@xyzw.org>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We should be able to eliminate these as the drawable remains unchanged.
However, the implicit flush of BUF_INFO fixes the rendering in KDE.
Alternatively, we need an MI_FLUSH | INHIBIT_RENDER_CACHE_FLUSH between
composites. (Note that it is not stale cache data causing the rendering
corruption and that a pipelined flush is not sufficient either.) Also,
having tried varies points at which to flush, the only place where the
flush is effective seems to be between composite operations - that is a
flush after 2D is not sufficient.
Reported-by: Vasily Khoruzhick <anarsoul@gmail.com>
Reported-by: Clemens Eisserer <linuxhippy@gmail.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
After d41684d545 we now allocate all framebuffers as tiled bo, and so
we must be careful to use the appropriate stride as returned from the
allocation, instead of assuming that it is just an aligned width.
Fixes:
Bug 28461 - screen rotation results in corrupted output.
https://bugs.freedesktop.org/show_bug.cgi?id=28461
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-by: Till Matthiesen <entropy@everymail.net>
Fixes:
Bug 28446 - Garbled Font with Mathematica 7
https://bugs.freedesktop.org/show_bug.cgi?id=28446
Rewriting the glyphs to render to the destination directly and removing
the more expensive multiple invocations of CompositePicture per picture
was a great performance boost -- except that it needs special handling
in the backend in order to not fallback. Having done so for i915, I
neglected to ensure the sanity checking in i965_prepare_composite() was
sufficient. As it turns out, it was not and so we misrendered CA-glyphs
when rendering directly to the destination. This causes us to fallback
properly, but is a performance regression as we no longer try the 2-pass
magic helper before resorting to s/w. At the moment, I'd rather live
with the temporary regression and fix i965 to do the same magic as i915,
as it critical to fixing the severe performance issues currently
crippling i965, as I believe that this regression only affects the
minority of applications (incorrect, as it turns out, as the glyphs are
overlapping) rendering directly to the destination.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Following a conversation with Owain G. Ainsworth, it was decided that
the second best approach to handling a wedged GPU was to hope that the
kernel could successfully reset it, which currently is only possible for
i965 and later chipsets.
The best approach is of course to prevent such hangs from ever occurring
in the first place.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
But emit the warning about rendering corruption every time for the
transient errors like out-of-memory.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
When I made libdrm stop overallocating so much memory for the purpose
of bo caching, things started scribbling on the bottom of my
frontbuffer (and vice versa, leading to GPU hangs). We had the usual
mistake of size = tiled_pitch * height instead of size = tiled_pitch *
tile_aligned_height.
We need to install the acceleration functions so that they are wrapped
by the Damage layer. This fixes the corruption under a compositing WM
introduced in commit 8700673157.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-and-tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
A trivial change, I thought, having tested it before rebasing, unworthy
even of a perfunctory compile test. How wrong I was.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The i915 textured video routine know how to handle drawing on an output
larger than the 3D pipe, so allow them to do so.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
gcc is horribly bad at collapsing the constants:
text data bss dec hex filename
282336 8720 256 291312 471f0 intel_drv.so.old
269280 8720 256 278256 43ef0 intel_drv.so
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Use centre sampling of textures to match pixman, and remove numerous
off-by-one and visual artefacts when rendering. The classic example for
this is cairo/text/xcomposite-projection where the edge of the rotated
rectangle is jaggy due to the incorrect sample position.
Fixes:
Bug 16917 - [i915] Blur on y-axis also when only x-axis is scaled
billiear
https://bugs.freedesktop.org/show_bug.cgi?id=16917
And about 15 tests from the Cairo test suite.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
3DSTATE_BUF_INFO is an implicit flush of the piepline, so avoid emitting
that and associated state unless the destination pixmap has actually
changed. This is a win of around 3-5% for cairo-perf-trace, notably for
firefox.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris's new buffer exchange check is a good one, but we don't want to
hit the immediate blit fallback path if it fails. We still want to
schedule a blit for sometime in the future, and we need to use it
wherever an exchange might occur (like the secondary flip check or the
currently disabled CanExchange check).
Fixes https://bugs.freedesktop.org/show_bug.cgi?id=28252.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This is wildly optimistic, but it should work in a surprising number of
error situations and some output in those cases will be hopefully be
better than none...
If we submit a batchbuffer and the kernel reports the GPU is hung (which
will be caused by an earlier execbuffer, and so the kernel should have
had enough time to determine whether or not it could reset the GPU) then
disable any further attempt to accelerate gfx and force fallbacks to map
the buffers and use the CPU. We cannot normally map any more buffers if
the GPU is hung, so only those already mapped prior to the hang can be
written to, or those allocated in system memory. However, we can expect
that the framebuffer is already mapped, and so have a reasonable
expectation to continue to see the display update.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Rewrite glyph rendering to avoid the intermediate buffer, accumulating
the glyph rectangles directly in the backend composite routines. And
modify the glyph cache routines to fully utilise the allocated size of
the tiled buffer on older hardware. To do this we alias all glyph sizes
into the same texture using a technique suggested by Keith Packard.
PineView:
885/856-> 1150/1110 kglyph/s (aa/rgb)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Handle rendering textured video onto an extended desktop (>2048) by
using a temporary pixmap. Note that we still cannot handle rendering to
a greater than 2048 destination region, for that we will need to tile.
Hmm, time to request a 2560x1600, 10bpc monitor...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
17:53 < arekm> ickle: i830_dri.c:630:28: error: ‘DrawableRec’ has no member named ‘bpp’
17:53 < arekm> ickle: i830_dri.c:630:57: error: ‘DrawableRec’ has no member named ‘bpp’
* sigh. I need to fix this machine to have the right version of the
* headers.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
An unredirected window (thanks Michel for the reminder) is backed by the
Screen pixmap, and so uses a reference of that as its front buffer. The
back buffer is a pixmap appropriately sized for the drawable. When the
application requests to swap its buffers, obviously we cannot simply
exchange the front and back buffer as they do not match, but need to copy
the appropriate region from the back to the front.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This reverts commit 44d45d3fa5.
Michel Dänzer pointed out the flaw in using the pixmap size instead of
the drawable size:
Using the backing pixmap dimensions for this is not desirable. In
particular, it means that the DRI2 buffers of non-redirected windows
always have the same size as the screen. But even for redirected windows
it wastes some graphics memory with a re-parenting window manager, that
is if it doesn't break in various ways due to the top left corner of the
DRI2 buffers no longer corresponding to the top left corner of the window.
This avoid using the garbage values stored in the Screen drawable,
instead of the true values which are only maintained in its backing
pixmap. The consequence of using the wrong size was to hand a 1x1
pixmap to metacity/mutter and have it believe it was a full screen
drawable; GPU hangs ensued if using page flipping.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
==7596== Invalid write of size 4
==7596== at 0x491ACA8: intel_batch_teardown (i830_batchbuffer.c:118)
==7596== by 0x491C9D6: I830CloseScreen (i830_driver.c:1419)
==7596== by 0x8103A9C: RRCloseScreen (randr.c:105)
==7596== by 0x80DE794: xf86CrtcCloseScreen (xf86Crtc.c:759)
==7596== by 0x80BEBA3: DGACloseScreen (xf86DGA.c:268)
==7596== by 0x80D044B: DPMSClose (xf86DPMS.c:134)
==7596== by 0x488B050: XvCloseScreen (xvmain.c:320)
==7596== by 0x81841B1: VidModeClose (xf86VidMode.c:110)
==7596== by 0x80EB12F: CursorCloseScreen (cursor.c:191)
==7596== by 0x810CA17: AnimCurCloseScreen (animcur.c:108)
==7596== by 0x816937E: compCloseScreen (compinit.c:86)
==7596== by 0x48D39B9: glxCloseScreen (glxscreens.c:221)
==7596== Address 0x49c1a50 is 24 bytes inside a block of size 52 free'd
==7596== at 0x4024866: free (vg_replace_malloc.c:325)
==7596== by 0x80B023C: Xfree (utils.c:1096)
==7596== by 0x4927CFD: i830_set_pixmap_bo (i830_uxa.c:647)
==7596== by 0x491C9B4: I830CloseScreen (i830_driver.c:1413)
==7596== by 0x8103A9C: RRCloseScreen (randr.c:105)
==7596== by 0x80DE794: xf86CrtcCloseScreen (xf86Crtc.c:759)
==7596== by 0x80BEBA3: DGACloseScreen (xf86DGA.c:268)
==7596== by 0x80D044B: DPMSClose (xf86DPMS.c:134)
==7596== by 0x488B050: XvCloseScreen (xvmain.c:320)
==7596== by 0x81841B1: VidModeClose (xf86VidMode.c:110)
==7596== by 0x80EB12F: CursorCloseScreen (cursor.c:191)
==7596== by 0x810CA17: AnimCurCloseScreen (animcur.c:108)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
If the destination cannot fit into the 3D pipeline when we need to
composite, we fallback to doing the operation on the CPU. This is very
slow, and quite easy to trigger on i915 by plugging in an external
display.
An alternative is to extract the extents of the operation from the
destination using the blitter which can usually handle much larger
operations. This gives us a temporary target that can fit into the 3D
pipeline and thus be accelerated, before copying back into the larger
real destination.
For x11perf this boosts glyph rendering on PineView, from 38kglyphs/s to
480kglyphs/s. Just a little shy of the native performance of 601kglyphs/s
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In theory this should allow us to pack far more operations into a single
batch buffer, and reduce our overheads.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
By using pwrite() instead of dri_bo_map() we can write to the batch buffer
through the GTT and not be forced to map it back into the CPU domain and
out again, eliminating a double clflush.
Measing x11perf text performance on PineView:
Before:
16000000 trep @ 0.0020 msec (511000.0/sec): Char in 80-char aa line (Charter 10)
16000000 trep @ 0.0021 msec (480000.0/sec): Char in 80-char rgb line (Charter 10)
After:
16000000 trep @ 0.0019 msec (532000.0/sec): Char in 80-char aa line (Charter 10)
16000000 trep @ 0.0020 msec (496000.0/sec): Char in 80-char rgb line (Charter 10)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
On my PineView box these represent ~5% overhead on x11perf text:
Before:
16000000 trep @ 0.0020 msec (495000.0/sec): Char in 80-char aa line (Charter 10)
12000000 trep @ 0.0022 msec (461000.0/sec): Char in 80-char rgb line (Charter 10)
After:
16000000 trep @ 0.0020 msec (511000.0/sec): Char in 80-char aa line (Charter 10)
16000000 trep @ 0.0021 msec (480000.0/sec): Char in 80-char rgb line (Charter 10)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Combine all the calls to composite between prepare_composite and
done_composite into a single primitive list, rather than a primitive
call per composite().
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
References:
Bug 28135 - [855GM] Slowdown/High CPU-Usage after Git-Commit
926fbc7d90https://bugs.freedesktop.org/show_bug.cgi?id=28135
The simple answer is that I had assumed that 0 was a reserved value.
However, without the bbp encoded into the format 0 was used for a8r8g8b8
and r5g6b5, which are very common formats!
The other possibility for the slowdown is that gtkperf is using of the
now verboten xrgb formats -- but would in fact be valid if the source
covers the clip and we could fixup the alpha value in the fixed function
combine.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
All textures are now properly declared so that the alpha swizzling
occurs in the sampler or not at all. The downside is that for quite a
few composite operations we have to fallback to software on older
hardware. There is scope for more performing the alpha expansion in
shaders or combiners when we know the picture covers the clip - which is
almost all of the time for normal operations especially those
constructed by Cairo.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We no longer workaround the lack of alpha expansion for xrgb textures as
this interferes with EXTEND_NONE, though we could if we know the source
covers the clip...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Allow us to check whether we can handle the operation using the blitter
prior to doing any work.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
I was blindly fixing rendercheck without thinking. We need to force the
alpha value to be in the blend unit and not before -- otherwise we
generate the incorrect result whilst blending. D'oh.
GEM handles serialisation of the new front buffer with respect to page
flipping and rendering and reports back when the flip is complete.
Adding a sync point here is then redundant.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we schedule swaps for some time in the future and may process a
detachment prior to receiving the vblank notification from the kernel,
we need to hold a reference to the buffers for our swap event handler.
Fixes:
Bug 28080 - "glresize" causes X server segfault with indirect rendering.
https://bugs.freedesktop.org/show_bug.cgi?id=28080
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Ensure that garbage is not stored in the unused alpha channel so that
we can rely on it being currently initialiased when used as a source or
returning via GetImage.
Partial fix for rendercheck -t blend
1. Instead of swapping bos, swap the entire private structure.
2. If we update the pixmap bo for the Screen, make sure we update the
reference inside intel->front_buffer so that xrandr still functions.
Fixes:
Bug 27922 - i965: Rapidly resizing OpenGL window causes GPU to hang.
https://bugs.freedesktop.org/show_bug.cgi?id=27922
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
We need to prevent overcommitting the aperture, and in particular if we
allocate a buffer larger than available space we will fail to mmap it in
and rendering will fail. Trying to allocate multiple large buffers in
the aperture, often the case when falling back, causes thrashes and
eviction of useful buffers. So from the outset simply do not allocate a
bo if the the required size is more than half the available aperture
space.
Fixes allocation failure in ocitymap.trace for instance.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
The pitch needs to be set on the pixmap prior to the private
intel_pixmap structure being created so that it can record the correct
value from the pixmap.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we can not accelerate these either as a destination or a source,
don't bother allocating a buffer object for them.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
x11perf has a regression
https://bugs.freedesktop.org/show_bug.cgi?id=25068
caused by
commit e581ceb738
i915: Use the color channels to pass along solid sources and masks.
Do not convert 1x1R pixmaps into a solid color as the readback from the
bo negates all the performances advantages of using a smaller vertex
buffer and fewer samplers.
Before (PineView):
aa=66800 glyph/s, rgb=28800 glyphs/s
Now:
aa=96800 glyphs/s, rgb=48500 glyphs/s
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
x11perf regression caused by 2D driver
https://bugs.freedesktop.org/show_bug.cgi?id=28047
caused by
commit a7b800513f
uxa: Extract sub-region from in-memory buffers.
The issue is that as we extract the region prior to checking whether the
composite can in fact be accelerated, we perform expensive surplus
operations. This is particularly noticeable for ComponentAlpha text,
such as rgb10text. The solution here is to rearrange the
check_composite() prior to acquiring the sources, and only extracting
the subregion if the render path can not actually handle the texture.
Performance (on PineView):
a7b800513^: aa=68600 glyphs/s, rgb=29900 glyphs/s
a7b800513: aa=65700 glyphs/s, rgb=13200 glyphs/s
now: aa=66800 glyph/s, rgb=28800 glyphs/s
The residual lossage seems to be from the extra function call and
dixPrivate lookups. Hmm. More warning is the extremely low performance,
however the results are consistent so the improvement looks real...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Complete the prepare access for the PutImage fallback via fbCopyArea(),
by remembering to set the private pointer to the GTT mapping.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
On older versions of pixman, pixman_blt() can return false if the images
are <= 8bpp. If we are being called from CopyArea, then we cannot return
FALSE here as that will trigger an infinite recursion. Instead we must
manually perform the fallback using fbCopyArea().
Reported-by: Peter Clifton <pcjc2@cam.ac.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This is some fallout from my xvmc cleanup.
Original-Patch-by: Rico Tzschichholz <ricotz@t-online.de>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
When we need to allocate a new bo for use as a gpu target, first check
if we can reuse a pixmap that has already been relocated into the
aperture as a temporary target, for instance a glyph mask or a clip mask.
Before:
backend test min(s) median(s) stddev.
xlib firefox-planet-gnome 50.568 50.873 0.30%
xcb firefox-planet-gnome 49.686 53.003 3.92%
xlib evolution 40.115 40.131 0.86%
xcb evolution 28.241 28.285 0.18%
After:
backend test min(s) median(s) stddev.
xlib firefox-planet-gnome 47.759 48.233 0.80%
xcb firefox-planet-gnome 48.611 48.657 0.87%
xlib evolution 38.954 38.991 0.05%
xcb evolution 26.561 26.654 0.19%
And even more dramatic improvements when using a font size larger than
the maximum size of the glyph cache:
xcb firefox-36-20090611: 1.79x speedup
xlib firefox-36-20090611: 1.74x speedup
xcb firefox-36-20090609: 1.62x speedup
xlib firefox-36-20090609: 1.59x speedup
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
As we only use the glyph cache for small glyphs, those large than 32x32
will first be copied to a bo and used as a mask in a composite
operation. We can avoid the allocation and upload per use by allocating
a bo for the over-sized glyph from the start. As the glyph is large
anyway, the excess memory allocation is less significant.
Using normal font sizes, firefox shows no change - as expected. However,
using the 36 font size traces, we see around a 10% improvement on g45.
Before:
xcb firefox-36-20090609 127.333 127.897 0.22%
xcb firefox-36-20090611 87.456 88.624 0.66%
xcb firefox-20090601 19.522 20.194 1.69%
xlib firefox-36-20090609 201.054 201.780 0.18%
xlib firefox-36-20090611 133.468 133.717 0.09%
xlib firefox-20090601 23.740 23.975 0.49%
With large glyphs in bo:
xcb firefox-36-20090609 117.256 118.254 0.42%
xcb firefox-36-20090611 79.462 79.962 0.31%
xcb firefox-20090601 19.658 20.024 0.92%
xlib firefox-36-20090609 185.645 188.202 0.68%
xlib firefox-36-20090611 123.592 124.940 0.54%
xlib firefox-20090601 23.917 24.098 0.38%
Thanks to Owain G. Ainsworth for the suggestion!
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
In order to avoid an infinite recursion after enabling CopyArea to use
the put_image acceleration to either stream a blit or to copy in-place,
we cannot call CopyArea from put_image for the fallback path. Instead,
we can simply call pixman_blt directly, which coincidentally is a tiny
bit faster.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This slighlty improves xrender performance on fence reg starved
i8xx hw.
I've also changed a few function calls to the new names from the
compat ones while looking at the code.
The i915 textured video path is not converted because atm the xv
code does not use tiled surfaces.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
We appear to have a confusion of stride in terms of pixels, pitch in
terms of bytes and the actual width of the surface.
i830_pad_drawable_width() appears to be operating aligning *pixels* to a
64 pixel boundary and has never used the chars-per-pixel causing
considerable confusion in its callers. Remove the parameter and ensure
that the callers are expecting a value in pixels returned, multiplying
by cpp where necessary to get the pitch.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Caught by a malloc library assert.
Note to self: Don't just copy&paste codelines around :(
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Buzilla: https://bugs.freedesktop.org/show_bug.cgi?id=27540
Tested-by: Nick Bowler <nbowler@draconx.ca>
Tested-by: Calvin Walton <calvin.walton@gmail.com>
For some reason I've made a mess out of the overlay stride constrains.
Fix it up.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Tested-by: Calvin Walton <calvin.walton@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=27453
In my recent fix for the chroma pitch for i915 xvmc I've forgotten about
i965 class hw. For videos with a non-even sized stride (measured in dwords)
the chroma pitch was internally incosistent and one dword off.
Fix this by using pitch2 for the chroma pitch in i965 textured video like
everywhere else.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=27417
Tested-by: Nick Bowler <nbowler@draconx.ca>
Tested-by: Sven Arvidsson <sa@whiz.se>
Simply store the desired bo size in intel_xvmc_context and initialize
it in the driver's create_context function.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
... by putting struct intel_xvmc_surface at the beginning. Also kill
the common context handling code and simply keep a pointer in the
surface private to the context.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
It's unused. Also drop all related generic code that tries to do
clever stuff with this callback. These are all remnants from a
pre-gem world.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
All of these are also stored in the context. Also kill the context
reference counting. Doesn't serve a purpose besides occupying a
pointer to the context in the private surface struct.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
... by putting struct intel_xvmc_surface at the beginning. This
will allow to consolidate surface and bo handling.
Also kill some now dead code used to handle the common surface
structure.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
We only passed around and actually used the gem handle. Don't
need a struct for one field alone ...
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
And kill all the static structures. This way it's clearer what's
common and what's specific. And the code is shorter too.
Also clean up src/i830_hwmc.c - kill the nonstandard surface types
for i915 and the associated code.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Doing the same with the i965 code will allow us to share the
create_context function.
src/i915_hwmc.h is now almost empty. Move the last #defines to
src/xvmv/i915_xvmc.c where they are actually used and delete the
file.
Also rename the ddx context struct to something sane.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Like for the subpicture stuff, share the "do-nothing" functions ...
And fix function name spelling, too.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Like for i915. Also drop that now totally superflous limit on the
available surfaces.
Move the surface struct into the userspace library header now that
the ddx doesn't use it anymore.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
The XvMC driver api in the server is insane. Even for optional stuff
like subpicture support it doesn't check for NULL-pointers. So we
have to retain some dummy functions.
Wonder how many copies of these things exist on fdo ...
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Both xvmc are handing in the bo in the exact same way. So move the code
to src/i830_video.c and kill this great oeuvre of spaghetti-code.
The xvmc driver ini and fini also lost their last use, kill them, too.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
After unifying i915 and i965, not much will be left of these files.
Therefore merge them to make the following changes easier.
This creates some warnings about some redefined macros, but when this
is all cleaned up they'll all be gone.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Pauli pointed out that we take a ref on the front buffer when exchanging
but forget to release it. The ref is necessary since the set functions
will drop refs as necessary, but once we set the front buffer to point
at the back pixmap, we ned to release our private ref again, or we'll
leak buffers.
Reported-by: Pauli Nieminen <suokkos@gmail.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
After reports of segmentation faults caused by
d6b7f96fde and vmware, the most obvious
cause would be illegally writing to the src data when performing the alpha
fill inline. So force the image upload to go via a fresh buffer whenever
we need to modify the incoming data.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
On memory constrained hardware, tiling is vital for good performance as
it minimizes cache misses. The downside is that for older hardware
(which often suffers from the lack of bandwidth) requires the use of
fences for many operations, which are in short supply and so may cause
shorter batchbuffers. However our batch buffers are typically short and
so this is unlikely to be a concern and not affect the performance wins.
A quick bit of testing suggests the effect is inconclusive on
firefox/i945:
linear tiled
xcb 205.470 206.219
xcb-render-0.0 404.704 388.413
xlib 166.410 170.805
A secondary effect of the patch is to workaround a G31 specific hang
when attempting to use linear 2048x2048 surfaces. Bonus!
Fixes:
Bug 25375 - Performance issue using texture from pixmap (tfp) glx extension on 945
http://bugs.freedesktop.org/show_bug.cgi?id=25375
Bug 27100 - GPU Hung copying a 2048x1152 pixmap
http://bugs.freedesktop.org/show_bug.cgi?id=27100
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: John <jvinla@gmail.com>
Otherwise it would be a random value and drmmode_page_flip_handler()
won't have a chance to call I830DRI2FlipEventHandler() and indicate
a full page flip is complete.
Signed-off-by: Li Peng <peng.li@intel.com>
Fixes:
http://bugs.freedesktop.org/show_bug.cgi?id=27123
Fatal server error:
i915_emit_composite_setup: ADVANCE_BATCH: under-used allocation 100/104
Introduced with commit d6b7f96fde.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Do not try to fixup the alpha in the ff/shaders as this has the
side-effect of overriding the alpha value of the border color, causing
images to be padded with black rather than transparent. This can
generate large and obnoxious visual artefacts.
Fixes:
Bug 17933 - x8r8g8b8 doesn't sample alpha=0 outside surface bounds
http://bugs.freedesktop.org/show_bug.cgi?id=17933
and many related cairo test suite failures.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Fixes a number of cairo test suite failures.
Also affects:
Bug 16917 - Blur on y-axis also when only x-axis is scaled bilinear
http://bugs.freedesktop.org/show_bug.cgi?id=16917
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
My cleanup accidently created a inconsistency in the YUV plane ordering.
I think we can safely assume that I'm colorblind ;)
As Carl Worth rightly pointed out, this change deserves a more elaborate
explanation:
For Xv planar formats, the three planes are stored consecutively in
memory, ordered Y U V. Now for some totally odd reason (= none at all),
i915 xvmc stored it in Y V U order. Right after the release of 2.10, with
commit "Xv: consolidate xmvc passthrough handling" I've inadvertently
broken xvmc support (which started this whole odyssey into xvmc). When
fixing stuff up, I neglected this special plane ordering and simply
assumed it to be the same as Xv and dropped that special case for i915 in
src/i830_video.c. This patch completes the change to standard YUV plane
ordering by making the corresponding change in src/xvmc/i915_xvmc.c.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Just make it mirror ScheduleSwap: complete the wait on any error
condition so as not to crash the client if the kernel is misbehaving.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
We can only handle 32 bit values unless we totally virtualize the count,
since the kernel only handles 32 bits itself. Rather than adding all
that overhead, just tolerate the occasional missed event everytime the
counter runs over.
Reported-by: Mario Kleiner <mario.kleiner@tuebingen.mpg.de>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
A couple more niggles: make sure we return a target_msc that at least
matches the current count; this is a little more friendly to clients
that missed an event. Also check for >= when calculating the remainder
so we'll catch the *next* vblank event when the calculation is
satisfied, rather than the current one as might happen at times.
Reported-by: Mario Kleiner <mario.kleiner@tuebingen.mpg.de>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
My merge of Mario's patch for this was botched. Fix it up so that OML
waits work correctly, and remove a bogus warning from ScheduleSwap.
Reported-by: Mario Kleiner <mario.kleiner@tuebingen.mpg.de>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The current code in I830DRI2ScheduleSwap() only schedules the correct
vblank events for the case divisor == 0, i.e., the simple
glXSwapBuffers() case.
In a glXSwapBuffersMscOML() request, divisor can be > 0, which would go
wrong.
This modified code should handle target_msc, divisor, remainder and the
different cases defined in the OML_sync_control extension correctly for
the divisor > 0 case.
It also tries to make sure that the effective framecount of swap
satisfies all constraints, taking the 1 frame delay in pageflipping mode
and possible delays in blitting/exchange mode due to
DRM_VBLANK_NEXTONMISS into account.
The swap_interval logic in the X-Servers DRI2SwapBuffers() call expects
the returned swap_target from the DDX to be reasonably accurate,
otherwise implementation of swap_interval for the glXSwapBuffers() as
defined in the SGI_swap_interval extension may become unreliable.
For non-pageflipped mode, the returned swap_target is always correct due
to the adjustments done by drmWaitVBlank(), as DRM_VBLANK_NEXTONMISS is
set.
In pageflipped mode, DRM_VBLANK_NEXTONMISS can't be used without severe
impact on performance, so the code in I830DRI2ScheduleSwap() must make
manual adjustments to the returned vbl.reply.sequence number.
This patch adds the needed adjustments.
Signed-off-by: Mario Kleiner <mario.kleiner@tuebingen.mpg.de>
Previous code only handled divisor == 0 case correctly. This should
honor a given target_msc for the divisor > 0 case and handle the
(msc % divisor) == remainder constraint correctly.
Signed-off-by: Mario Kleiner <mario.kleiner@tuebingen.mpg.de>
If a drawable isn't visible due to DPMS or redirection, we'll just blit
it rather than schedule a swap event. However, we didn't reset the
target_msc, so the swap target we receive from the server could get out
of sync with the vblank count of the drawable's display. So at DPMS on
time, the swap target would be the last good vblank count plus some
large number (since the swaps won't have been throttled).
Solve this by zeroing out the swap target like we should when we fall
back to a blit. Also make the kernel error cases more friendly by
making them fall back to blits too.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Once we hit this error it's unlikely that we're coming back - so don't
flood the logs with redundant information.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
This kills one wip remnant from my i830_memory cleanup and the last
remainings of the subpicture support.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
In the long long ago, fbOffset was used for DGA. The server now has
only one reference to fbOffset, a leftover setting of it in fbdevhw.
We can safely ignore it now, which is good since we weren't updating
it in other places where the front buffer offset could change.
We know that it's clobbered at each batchbuffer, anyway. And even if
this server isn't running DRI2, it can still be clobbered at batch
start in the KMS world.
The previous code made no sense, (multiplying an offset by 4 is
meaningless). It could have onlt worked with the offset being
fortuitously 0.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
Like with the per context stuff, also drop the now artificial limit
on surfaces. Again, with that gone, a lot of code can be deleted.
Reviewed-by: Carl Worth <cworth@cworth.org>
There's now not a reason anymore to limit the number of active contexts.
So kill this accounting, too.
With that all gone, per-context state in the ddx is nil, so rip out
all associated code.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
Proper bo management ensures that the cpu doesn't step on buffers
used by the gpu. Drop the now unnecessary synchronization.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
Cache coherency is now fully under the control of gem.
For lack of hw documentation, I had to find out the correct cache
placements by trial and error:
Backward and forward surfaces: I915_GEM_DOMAIN_RENDER
Correlation data: I915_GEM_DOMAIN_SAMPLER
Changing any of them leads to visual corruptions, so I think these
are the correct ones.
Reviewed-by: Carl Worth <cworth@cworth.org>
Now the last user of the fixed buffers provided by the ddx is gone!
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
It works!
v2: Correlation data needs to be in the render cache!
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
I've decided to allocate a new buffer for every render command, to
prevent stalling for the gpu. libdrm bo reuse should take care of
not wasting memory in case the buffer is not busy.
Also always emit the full state, it's not worth it to complicate
the code over a few stores to wc memory.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
Like with one_time_state_emit, this preps for relocatable bo's.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
This also starts to kill the last remnants of the support for
physical addresses for the indirect state buffers. With gem this
would need kernel support (in the form of a new reloc type in
execbuf2).
This does not change the ABI between ddx and client libIntelXvMC.
I've decided to do this in one swoop when all the buffer rework is
done.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
Seems to be a remnant from i810 XvMC support. last_flip is always 0,
so serves no real purpose anymore. Kill it and the associated code.
With last_flip gone, last_render also lost its purpose. Kill it, too.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
This is in preparation for real relocatable drm_bo's instead
of memory at a fixed address. By switching to the batchbuffer
macros (like i965 xvmc) we can use the nice OUT_RELOC macro.
Also align the code more with coding-style elsewhere, i.e. bitops
instead of bitfield structures. The bitfield structures are
quite a mess to work with the batchbuffer macros, so they were
getting in the way, anyway.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
WIP code that hasn't changed for over two years is unlikely to
suddenly start progressing. Drop it. After all, git can easily
resurect it in cases it's needed.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
Yes, this breaks binary compat of the struct passed around between
X ddx and the client libXvMC. But we always ship both, so they should
not get out of sync.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Carl Worth <cworth@cworth.org>
Kill the corresponding !bo path in i830_free_memory.
Also kill another remnant of the pre-kms era in the same file, while I
was looking at the code.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Eric Anholt <eric@anholt.net>