[webkit-dev] Fwd: Using WebKit's markup as facility of Chrome's DOM serializer

Tue Nov 18 00:45:57 PST 2008

Hello Everyone,
This is Johnny from Google Beijing office.  I had implemented
the DomSerializer for Chrome. Now as Darin discussed with you guys before,
we decide to use WebKit's markup as facility of Chrome's DOM serializer. I
am very glad that I can continue working on it with you WebKit-developers.

After reading markup code, I want to discuss with you the following features
we implemented in Chrome's DomSerializer.
Actually, in current DomSerializer implementation, we already use
createMarkup to serialize all available nodes except Element node, it is
because
 1.  now Markup can not fix up URL references to point to the right place
for saved subresources.  but if we serialize DOM for saving
     webpage to local storage system, we need to replace all
saved subresources with local URLs.

2.  we add the mark of the web(MOTW) before meeting <HTML> tag for saved
HTML page. Please refer the following URL to know What's MOTW.

http://msdn.microsoft.com/en-us/library/ms537628.aspx#Adding_MOTW_to_HTML_Documents
     It looks lots of users that install
Safari/Chrome, but keep MSIE as their default browser - which we
would expect to be a sizable population,

at least initially. Since all it takes to protect them against
file:///-related attacks in MSIE is adding a single line to saved
HTML,
we think it's
     worth doing, even if not very important.

3.  Instead of serializing page to Unicode content, we want to serialize
page with its current encoding, another option is letting user choose
his/her
     desired encoding/character set. Also we need to rewrite correct <META>
declaration in <HEAD> section.  Currently in chrome, we
     a) Use TextEncoding to encode every piece of serialized html content by
the encoding which is used in current page(frame)
     b) We skip all <META> tags which declare charset
     c) We append a new <META> tag which have correct charset declaration
after open tag of HEAD element, which will make sure that
        <META> charset declaration will be detected by parser.

4.  we comment all <BASE> tags and replace with new <BASE> tags
    --Current status:
    At now the normal way we use to handling base tag is
    a) For those links which have corresponding local saved files, such
as savable CSS, JavaScript files, they will be written to relative URLs
which
       point to local saved file. Why those links can not be resolved as
absolute file URLs, because if they are resolved as absolute URLs, after
moving the
       file location from one directory to another directory, the file URLs
will be dead links.
     b) For those links which have not corresponding local saved files, such
as links in A, AREA tags, they will be resolved as absolute URLs.
     c) We comment all base tags when serialzing DOM for the page.
     FireFox also uses above way to handle base tag.

   Problem:
   This way can not handle the following situation: the base tag is written
by JavaScript. For example. The page "www.yahoo.com" use
   we assume "document.write('<base href="http://www.yahoo.com/"...');" to
setup base URL of page when loading page. So when saving page
   as completed-HTML,  that we save "www.yahoo.com" to "c:\yahoo.htm". After
then we load the saved completed-HTML page, then the
   JavaScript will insert a base tag <base href="http://www.yahoo.com/"...>
to DOM, so all URLs which point to local saved resource files will be
   resolved as "http://www.yahoo.com/yahoo_files/...", which will cause all
saved  resource files can not be loaded correctly. Also the page will
   be rendered ugly since all saved sub-resource files (such as CSS,
JavaScript files) and sub-frame files can not be fetched.
   Now FireFox, IE based Browser all have this problem.

  The way we used in Chrome:
  It is that we comment old base tag and write new base tag: <base href="."
...> after the previous commented base tag. In WebKit, it
  always uses the latest "href" attribute of base tag to set document's
base URL. Based on this behavior, when we encounter a base tag,
  we comment it and write a new base tag <base href="."> after the previous
commented base tag. The new added base tag can help engine
  to locate correct base URL for correctly loading local saved resource
files. Also I think we need to inherit the base target value from document
  object when appending new base tag. If there are multiple base tags in
original document, we will comment all old base tags and append new
  base tag after each old base tag because we do not know those old base
tags are original content or added by JavaScript. If they are added by
 JavaScript, it means when loading saved page, the script(s) will still
insert base tag(s) to DOM, so the new added base tag(s) can override the
  incorrect base URL and make sure we alway load correct local saved
resource files. But there is still a problem. when loading saved page, if
the
 JavaScript append its <BASE> tag after the new added <base href=".">  tag.
then the base href will still point to wrong place.

5. Instead of generating all content at one time, in Chrome we use a
fixed-size chunk to save serialized HTML content, once the chunk is full, we
   process the content of the chunk and clean the chunk before continuing
serializing HTML content.  With using this way, we can do increment
   data saving without using a big buffer to cache the whole page, which
could be huge

For adding above features to markup. I think we can define a abstract class
called MarkupClient
class MarkupClient {
  // The following fore 4 methods are for feature 2,3,4
  // Before we begin serializing open tag of a element, we give the target
  // element a chance to do some work prior to add some additional data.
  String preActionBeforeSerializeOpenTag(const Element* element, bool*
needSkipThisOpenTag) = 0;

  // After we finish serializing open tag of a element, we give the target
  // element a chance to do some post work to add some additional data.
  String postActionAfterSerializeOpenTag(const Element* element) = 0;

  // Before we begin serializing end tag of a element, we give the target
  // element a chance to do some work prior to add some additional data.
  String preActionBeforeSerializeEndTag(const Element* element,  bool*
needSkipThisEndTag) = 0;

  // After we finish serializing end tag of a element, we give the target
  // element a chance to do some post work to add some additional data.
  String postActionAfterSerializeEndTag(const Element* element) = 0;

  // Call this method to save the markup content. Once getting content,
  // call this method to save it. The derived class can easily use a big
  // buffer to save the all content or do some increment process.
  // It's for feature 5, we also can do text encoding in this function.
  bool saveMarkupContent(const String& markupContent) = 0;

  // Get saved subresources' corresponding local URLs. It's for feature 1
  bool getSavedLocalSubResourceURL(const String&
originalSubResourceUrl, String* savedLocalSubResourceUrl) = 0;
}

Then we change the definition of createMarkup to
String createMarkup(const Node* node, EChildrenOnly includeChildren,
Vector<Node*>* nodes, MarkupClient* markupClient);

For original request of markup. we have an class derived from MarkupClient.
we only implement saveMarkupContent method
by using a big buffer to save all markup data. The other methods will
have empty implementation.
For saving complete-HTML page, we can have another class derived
from MarkupClient which implement all methods.

Would you please give your comments and correct me. I am looking forward to
your reply.
Thanks!

-- 
Best Regards.
Johnny
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20081118/49faefab/attachment.html>