[webkit-gtk] guchar * sometimes a utf8, sometimes utf16?
Michael Catanzaro
mcatanzaro at igalia.com
Thu May 31 17:03:33 PDT 2018
On Thu, May 31, 2018 at 5:05 PM, Leo Ufimtsev <Leonidas at redhat.com>
wrote:
> Hello guys,
>
> The following function:
> guchar * webkit_web_resource_get_data_finish(..)
>
> Sometimes returns utf8 and sometimes utf16. Is there a way to tell
> them apart?
>
> Thank you.
Hm, good question. I don't know the answer, but here are some thoughts
anyway:
We use guchar instead of gchar to indicate that it's a byte array, not
a string, so it's not expected to be UTF-8. In fact, it could be any
arbitrary encoding, not just UTF-16. I've seen more esoteric encodings
before, particularly for CJKV websites. Of course, it might not be an
HTML resource at all, it could be an image or an executable file or
anything.
Assuming you know it is an HTML doc, then I think you want to parse the
charset from the meta tag. Of course, that's a bit difficult because
you do not know the encoding you should be using to parse it until
after you have somehow successfully parsed it. I don't know how you
would do it, but clearly WebKit knows how, somewhere. In Epiphany, our
use is limited to saving resources on disk, which then get parsed by
other applications when you open them, which is why we've never needed
to deal with this problem.
For a website loaded via HTTP, the encoding could also have been set by
an HTTP header. There's really nothing you can do in that case, as you
don't have access to that.
I think Firefox uses an encoding detector. WebKit does not, but it's
one option. ICU can do this, as can uchardet. Problem is, they are
probabilistic and do not work well for some important encodings (e.g.
GB18030). But that might work well enough for your needs.
Michael
More information about the webkit-gtk
mailing list