[Webkit-unassigned] [Bug 90375] Parallel image decoders

Wed Aug 1 01:32:15 PDT 2012

https://bugs.webkit.org/show_bug.cgi?id=90375

--- Comment #49 from Huang Dongsung <luxtella at company100.net>  2012-08-01 01:32:10 PST ---
UPDATE! Here are the the test results.

Two tests were reinforced.
1. Response times of interactive web pages.
2. First painting and response time on a 2 core embedded device.

Previously, we measured the first painting time on a high-end PC (Xeon 5650).
We are curious about two questions.
1. How to show other advantages of parallel image decoders in addition to the first painting time.
2. Do parallel image decoders perform well on embedded devices?

See the results in the following link.
https://docs.google.com/spreadsheet/pub?key=0Ar2smwimcenMdGpnTEw2clZjSTNkbXNFNFM5dkYyRGc&output=html

@ Summary of the results
1. The response time of interactive web pages were improved by  approximately 20%.
2. An embedded device shows a similar performance enhancement as a high-end PC.

@ Test Environment
1. High-end PC: Intel ® Xeon (R) CPU X5650 at 2.67GHz × 6
2. Embeded device: Pandaboard ARM Cortex-A9 at 1.2GHz × 2

We Used WebKit1 compiled with Qt4.8.1.

@ Details of each test

eBay Scrolling Test: Gmarket is South Korea's eBay. We measured the FPS of scrolling the page, which included many pretty girls' pictures.
High-end PC: Measured 99 FPS in the threaded case. The FPS limit in Qt WebKit1 is 100 FPS. I think the threaded case might exceed 99 FPS in other ports.
Embeded device: 21% performance improvement.

35 Images: There are 35 large images scaled down to fit on the screen.
High-end PC: The first painting time is four times faster when MemoryCache cached raw data.
Embeded device: The first painting time is 20% faster when MemoryCache cached raw data.

DOMContentLoaded: There are 3 large images that are scaled down to fit on one page. This test belongs to IE Test Drive.
High-end PC: The first painting time is 56% faster when MemoryCache cached raw data.
Embeded device: The first painting time is 20% faster when MemoryCache cached raw data.

RomainGuy: Amateur photographer's blog.
High-end PC: The first painting time is 17% faster when MemoryCache cached raw data.
Embeded device: It makes very little difference.

Apple: iMac's introduction page
High-end PC: The first painting time is 82% faster when MemoryCache cached raw data.

Tumblr : There are many medium size images in the blog.
High-end PC: The first painting time is 7.5% faster when MemoryCache cached raw data.

@ The interpretation of the test results
1. Interactive page’s response times are improved. Image decoding processes off the main thread, so the main thread can concentrate on the animation or scrolling.
2. 35 Images and DOMContentLoaded has improved the first painting time a lot. It is because decoding several large images takes advantage of multiple CPU cores.
On the other hand, RomainGuy, Apple and Tumblr need a closer look for interpretation of the results, although three sites are very similar, only the Apple case has exceptionally high performance improvement.
In the Apple case, there is a lot of work done in the main thread after requesting image decoding from the parallel image decoders. Web Inspector shows CSS styling, layout and rendering are performed consistently. However, in the RomainGuy and Tumblr cases, the request for image decoding is near end of the page's loading and painting. After the main thread requests image decoding, it usually waits for decoding to complete without any other heavy jobs running. I removed Javascript code in order to preventing resources loading dynamically, which could exacerbate the situation.
Currently, The trigger conditions for parallel Image decoders are image size, the number of CPU cores and image format. In the future, we can add the work load of the main thread to the trigger conditions, to reduce unnecessary thread communication overhead.

@ Settings for testing
1. All test data was stored locally in order to avoid the effects of network latency. I removed most of the Javascript code in order to prevent resources from loading dynamically.
2. The difference in full loading time between threaded and originally referenced cases is negligible. This is because network latency is the dominant factor, so the main thread often waits until finishing sub resource loading. We kept most CachedImages in MemoryCache for a more precise measurement of image decoding time.

Code stub for #2.
settings()->setUsesPageCache(false);
memoryCache()->setDeadDecodedDataDeletionInterval(0.01);
memoryCache()->setCapacities(0, 256MB, 256MB);

@ Details of test embedded device.
- Ubuntu 12.04 on Pandaboard. http://pandaboard.org/
- Xorg occupies about 35% of the CPU, so parallel image decoders can not fully use both cores.
- Set Qt configuration like Bug 84321.
- printf from a shared library does not output anything on the terminal, so I compiled WebKit as static.
- The static build caused crashes on the sites with external javascript files, so I tested only the sites that do not crash.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.