[webkit-gtk] guchar * sometimes a utf8, sometimes utf16?

Thu May 31 17:03:33 PDT 2018

On Thu, May 31, 2018 at 5:05 PM, Leo Ufimtsev <Leonidas at redhat.com> 
wrote:
> Hello guys,
> 
> The following function:
> guchar * webkit_web_resource_get_data_finish(..)
> 
> Sometimes returns utf8 and sometimes utf16. Is there a way to tell 
> them apart?
> 
> Thank you.

Hm, good question. I don't know the answer, but here are some thoughts 
anyway:

We use guchar instead of gchar to indicate that it's a byte array, not 
a string, so it's not expected to be UTF-8. In fact, it could be any 
arbitrary encoding, not just UTF-16. I've seen more esoteric encodings 
before, particularly for CJKV websites. Of course, it might not be an 
HTML resource at all, it could be an image or an executable file or 
anything.

Assuming you know it is an HTML doc, then I think you want to parse the 
charset from the meta tag. Of course, that's a bit difficult because 
you do not know the encoding you should be using to parse it until 
after you have somehow successfully parsed it. I don't know how you 
would do it, but clearly WebKit knows how, somewhere. In Epiphany, our 
use is limited to saving resources on disk, which then get parsed by 
other applications when you open them, which is why we've never needed 
to deal with this problem.

For a website loaded via HTTP, the encoding could also have been set by 
an HTTP header. There's really nothing you can do in that case, as you 
don't have access to that.

I think Firefox uses an encoding detector. WebKit does not, but it's 
one option. ICU can do this, as can uchardet. Problem is, they are 
probabilistic and do not work well for some important encodings (e.g. 
GB18030). But that might work well enough for your needs.

Michael