Re: [webkit-dev] reclaim of number cells
Hi Maciej Thank you for your reply. The algorithm itself is very simple. Think about the following expression: a = b + c + d. The result of (b+c) is just a temporay value, which is not used anymore after this expression is evaluated. From technical viewpoint, a shadow register file tracks those numbers cells, which reference counter is exatly one. If a reference counter increases (caused by instructions like mov, put_by_id, call...), we remove the particular number cell from this shadow register file. However, if a new value is stored in a register, and a number cell is pointed by its shadow register, we can safely reclaim that number cell. The perfromance drawback is that we always perfrom this check after some selected instructions (like add, sub, mul, div, mod, ...) even if the current benchmark never allocates a number cell. Your idea of 64 bit VM registers are interesting. That makes this patch a complete waste. Since most of our systems are 64 bit systems, we already have 64 bit VM registers :) That makes me wondering, how do you plan to store a 64 bit pointer using a 64 bit VM register? Cheers, zoltan
Hi Zoltan,
The performance results are certainly interesting. Would you mind explaining the algorithm a bit more?
Also, do you know why it is slowing down the tests that it does? In general the tests that seem hurt in performance are ones that would use a lot of immediate numbers. I can understand why they would not benefit from reducing allocation of number cells, but why do they get slower? If we understand that, we might be able to come up with a version that has the upsides but not the downsides of your patch.
I should also note, we are considering using 64-bit VM registers (with a special encoding that well let us store a double, a pointer or an int without a separate type tag, basically all non-double types would look like a NaN but not the standard NaN, so the low 48 or so bits are free.) In such a scheme, number cells would be GC-allocated only if the number is stored in an object (or possibly not at all if we decide we can make objects and arrays bigger).
That's probably a few weeks off though, so I'd be interested in whether we can make your patch work.
Regards, Maciej
On Jun 26, 2008, at 2:38 AM, Zoltan Herczeg wrote:
Hi All,
I am Zoltan Herczeg, working for the Dept. of Soft. Eng., Univ. of Szeged. I am currently investigating the JavaScriptCore engine to find some speed-up solutions. During my investigation, I have encountered an interesting issue for the Garbage Collector. Namely, in the case of some sunspider benchmark programs, the GC is frequently called. For example, the runtime of 3d-cube is 121 msec, and it calls the slow Collect() function about 170 times. Thus, the garbage collecting takes the large portion of its runtime (~30%). Such benchmarks cunsume the available Number Cells in a short time, so the GC has a lot of work to reclaim these cells. My idea is to have a simple algorithm, which can reclaim many of these cells without the help of the GC. I have attached an experimental patch, and the performance-results as well. As you can see, it is a rather selective patch. Some programs benefit from it (25% performance improve for 3d-cube), while some do not (like controlflow-recursive, which does not allocate any number cells at all). Perhaps this approach could be a custom-speedup, which should be turned on, when a treshold value is reached. (Number of collects() in a given time is greater than a pre-defined value.) However, the mainloop (Machine:: PrivateExecute) should be duplicated in this case, which is not an easy task because of the computed gotos. Do you have any other idea to make this patch useful?
Regards, Zoltan<gc- improve -34737 .patch><compare.txt>_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
On Jun 26, 2008, at 5:20 AM, Zoltan Herczeg wrote:
Hi Maciej
Thank you for your reply. The algorithm itself is very simple. Think about the following expression: a = b + c + d. The result of (b+c) is just a temporay value, which is not used anymore after this expression is evaluated. From technical viewpoint, a shadow register file tracks those numbers cells, which reference counter is exatly one. If a reference counter increases (caused by instructions like mov, put_by_id, call...), we remove the particular number cell from this shadow register file. However, if a new value is stored in a register, and a number cell is pointed by its shadow register, we can safely reclaim that number cell. The perfromance drawback is that we always perfrom this check after some selected instructions (like add, sub, mul, div, mod, ...) even if the current benchmark never allocates a number cell.
Is it possible to skip the check in the fast path for immediate numbers that these instructions have? That would probably remove the drawback.
Your idea of 64 bit VM registers are interesting. That makes this patch a complete waste. Since most of our systems are 64 bit systems, we already have 64 bit VM registers :) That makes me wondering, how do you plan to store a 64 bit pointer using a 64 bit VM register?
The only pointers that have to be stored will be into the JavaScript heap, and we can use VM allocation tricks to ensure that it is limited to 48 bits or so of real address information in the 64-bit case. (Collector memory already uses low-level mechanisms to allocate directly from the VM subsystem.) - Maciej
Cheers, zoltan
Hi Zoltan,
The performance results are certainly interesting. Would you mind explaining the algorithm a bit more?
Also, do you know why it is slowing down the tests that it does? In general the tests that seem hurt in performance are ones that would use a lot of immediate numbers. I can understand why they would not benefit from reducing allocation of number cells, but why do they get slower? If we understand that, we might be able to come up with a version that has the upsides but not the downsides of your patch.
I should also note, we are considering using 64-bit VM registers (with a special encoding that well let us store a double, a pointer or an int without a separate type tag, basically all non-double types would look like a NaN but not the standard NaN, so the low 48 or so bits are free.) In such a scheme, number cells would be GC-allocated only if the number is stored in an object (or possibly not at all if we decide we can make objects and arrays bigger).
That's probably a few weeks off though, so I'd be interested in whether we can make your patch work.
Regards, Maciej
On Jun 26, 2008, at 2:38 AM, Zoltan Herczeg wrote:
Hi All,
I am Zoltan Herczeg, working for the Dept. of Soft. Eng., Univ. of Szeged. I am currently investigating the JavaScriptCore engine to find some speed-up solutions. During my investigation, I have encountered an interesting issue for the Garbage Collector. Namely, in the case of some sunspider benchmark programs, the GC is frequently called. For example, the runtime of 3d-cube is 121 msec, and it calls the slow Collect() function about 170 times. Thus, the garbage collecting takes the large portion of its runtime (~30%). Such benchmarks cunsume the available Number Cells in a short time, so the GC has a lot of work to reclaim these cells. My idea is to have a simple algorithm, which can reclaim many of these cells without the help of the GC. I have attached an experimental patch, and the performance-results as well. As you can see, it is a rather selective patch. Some programs benefit from it (25% performance improve for 3d-cube), while some do not (like controlflow-recursive, which does not allocate any number cells at all). Perhaps this approach could be a custom-speedup, which should be turned on, when a treshold value is reached. (Number of collects() in a given time is greater than a pre-defined value.) However, the mainloop (Machine:: PrivateExecute) should be duplicated in this case, which is not an easy task because of the computed gotos. Do you have any other idea to make this patch useful?
Regards, Zoltan<gc- improve -34737 .patch><compare.txt>_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
_______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Hi Maciej, It seems you are right, my patch can afford some tweaks. As you can see, the gain is doubled compared to the original one. By the way, I was thinking another issue. In the current build, every JS object stores its own string lookup table. Is it possible to merge these string tables into a global one? I hope it will decrease both the memory consumption and the runtime, since we only need pointer comparison instead of string comparison. Regards, Zoltan
On Jun 26, 2008, at 5:20 AM, Zoltan Herczeg wrote:
Hi Maciej
Thank you for your reply. The algorithm itself is very simple. Think about the following expression: a = b + c + d. The result of (b+c) is just a temporay value, which is not used anymore after this expression is evaluated. From technical viewpoint, a shadow register file tracks those numbers cells, which reference counter is exatly one. If a reference counter increases (caused by instructions like mov, put_by_id, call...), we remove the particular number cell from this shadow register file. However, if a new value is stored in a register, and a number cell is pointed by its shadow register, we can safely reclaim that number cell. The perfromance drawback is that we always perfrom this check after some selected instructions (like add, sub, mul, div, mod, ...) even if the current benchmark never allocates a number cell.
Is it possible to skip the check in the fast path for immediate numbers that these instructions have? That would probably remove the drawback.
Your idea of 64 bit VM registers are interesting. That makes this patch a complete waste. Since most of our systems are 64 bit systems, we already have 64 bit VM registers :) That makes me wondering, how do you plan to store a 64 bit pointer using a 64 bit VM register?
The only pointers that have to be stored will be into the JavaScript heap, and we can use VM allocation tricks to ensure that it is limited to 48 bits or so of real address information in the 64-bit case. (Collector memory already uses low-level mechanisms to allocate directly from the VM subsystem.)
- Maciej
participants (2)
-
Maciej Stachowiak
-
Zoltan Herczeg