On Jun 13, 2006, at 8:59 AM, Mike Reed wrote:
... <input type="checkbox" name="box" checked="checked" />Test <input ...> ...
When I draw this page, I see a box at the end of "Test". "Test" is comming into Font::drawText() as a 5 character string, with a CR (or LF, don't remember which) at the end. In my font, that draws as a box.
Is it correct that the parser didn't strip that, or convert it into a space?
Yes. The parser must not convert it to a space; the DOM must contain a space.
If so, is my port expected to strip these sorts of characters each time I measure or draw (hurting performance)?
Yes. Having the text rendering machinery handle these characters specially makes things faster on platforms where we can do that efficiently (which now includes both Macintosh and Windows on TOT, since there's shared high speed text rendering code). Allan outlined a way we could change bidi.cpp to implement this rule at a higher level. If we can do that without hurting performance on Macintosh and Windows, we could take the code out. Hyatt's the one who's been working on this recently.
If I had complete control over all my fonts, I could wack their cmap tables to ensure that all control characters mapped to zero- width spaces, but I don't have the luxury.
There may be other ways to do that quickly in the text rendering layer, for example it's probably quite quick to scan a string and check if any characters are in this range. In the case where they are, then you have to allocate a buffer and copy the string, but I think that's relatively rare. I'd also be comfortable taking a patch that changes it so that the bidi.cpp level takes care of this and the code from the platform directory doesn't have to handle it any more. Since this is a highly-performance-sensitive part of the code, and the way we do this now is very fast, we have to make sure we do performance measurements if we change how this works.
If I am required to handle these control characters, is there a list of exactly which the parser will pass through?
Here's the rule, taken from the code in GlyphMap.cpp (now cross- platform on TOT, formerly Macintosh-specific code) that implements the rule for the fast code path: Control characters (U+0000 - U+0020, U+007F - U+00A0) must not render at all. \n (U+000A), \t (U+0009), and non-breaking space (U+0020) must render as a space. -- Darin