[webkit-gtk] guchar * sometimes a utf8, sometimes utf16?

Mon Jun 4 12:49:34 PDT 2018

Hello Michael,

Thank you for your feedback.

In Webkit1 we used to do it this way:
> frame = webkit_web_view_get_main_frame(webview)
> source = webkit_web_frame_get_data_source (frame);
> encoding = webkit_web_data_source_get_encoding (source);

'encoding' would be something like "UTF8". So we can deal with the string
in the relevant encoding.

For webkit2, I cannot find a way to get encoding. Just the bytes:
guchar * gu_data = webkit_web_resource_get_data_finish(...)

How did webkit1's 'webkit_web_data_source_get_encoding()' function retrieve
the encoding and is there a way to do the same on webkit2?

Thank you.

On Thu, May 31, 2018 at 8:03 PM, Michael Catanzaro <mcatanzaro at igalia.com>
wrote:

> On Thu, May 31, 2018 at 5:05 PM, Leo Ufimtsev <Leonidas at redhat.com> wrote:
>
>> Hello guys,
>>
>> The following function:
>> guchar * webkit_web_resource_get_data_finish(..)
>>
>> Sometimes returns utf8 and sometimes utf16. Is there a way to tell them
>> apart?
>>
>> Thank you.
>>
>
> Hm, good question. I don't know the answer, but here are some thoughts
> anyway:
>
> We use guchar instead of gchar to indicate that it's a byte array, not a
> string, so it's not expected to be UTF-8. In fact, it could be any
> arbitrary encoding, not just UTF-16. I've seen more esoteric encodings
> before, particularly for CJKV websites. Of course, it might not be an HTML
> resource at all, it could be an image or an executable file or anything.
>
> Assuming you know it is an HTML doc, then I think you want to parse the
> charset from the meta tag. Of course, that's a bit difficult because you do
> not know the encoding you should be using to parse it until after you have
> somehow successfully parsed it. I don't know how you would do it, but
> clearly WebKit knows how, somewhere. In Epiphany, our use is limited to
> saving resources on disk, which then get parsed by other applications when
> you open them, which is why we've never needed to deal with this problem.
>
> For a website loaded via HTTP, the encoding could also have been set by an
> HTTP header. There's really nothing you can do in that case, as you don't
> have access to that.
>
> I think Firefox uses an encoding detector. WebKit does not, but it's one
> option. ICU can do this, as can uchardet. Problem is, they are
> probabilistic and do not work well for some important encodings (e.g.
> GB18030). But that might work well enough for your needs.
>
> Michael
>
>

-- 
Leo Ufimtsev, Software Engineer, Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-gtk/attachments/20180604/1829d4d8/attachment.html>