[webkit-dev] size_t vs unsigned in WTF::Vector API ?

Thu Nov 20 11:10:37 PST 2014

I think I'm starting to see what you're saying, so let me restate it my way to be sure:

Code that deals with various loaded blobs must be resilient against them getting larger than a 32-bit size can handle and this is the subject of ongoing development effort.  Such code may or may not use a Vector.  It definitely must use size_t for all of its sizes.  The trouble with Vector using unsigned in its API then causes the following troubles:

If the code uses a Vector for the large data: in addition to not being able to actually hold all of the data, there will be a risk of truncation from size_t to unsigned on the client side of the Vector, which is much harder to prevent than a truncation inside Vector itself.  For such code the ideal would be a Vector that measures itself using size_t, both in its internal data structure and in its API.

If the code does not use Vector for that large data but uses it nonetheless for other things: things won't obviously go wrong but you end up in sketchy territory: you will have code that is juggling unsigned's for the Vector and size_t's for the large data.  Any arithmetic between the two can easily become suspect.  Consider a case where you might use a metric of size or offset in the large data in some arithmetic that flows into Vector.  If Vector's indexing operations use unsigned, then you will have truncation.

These problems fall into two categories:

1) Actually not being able to store the large data but failing gracefully when it happens.

2) Truncation leading to some kind of undesirable behavior.

Someone mentioned something about warnings that can tell us about (2).  Apparently we cannot turn the warning on right now.  The hope is that if we convert everything to unsigned then we won't have the warnings in trunk and then we can turn the warning on, and catch such bugs.  This seems very aspirational.

Also, it means that everyone hacking on code that mixes "sizes of files" with "sizes of vectors" will have to scratch their heads about which type to use, and then if they do it wrong they will get some warnings.  Of course, the warnings can easily be silenced with a cast.  That's super dangerous.  In my experience, if the easiest way to make the compiler STFU is by doing the wrong thing (static_cast, in this case) then people will do the wrong thing quite frequently.

(A particularly funny example of this is checked exceptions in Java.  If you call a method declared to throw an exception then the caller must either catch it or declare that they throw it, too.  The easiest way to silence the error is to catch it with an empty catch block that ignores the exception, since that requires the fewest keystrokes.  Someone once did a study of large Java codebases and found such empty catch blocks to be a chronic problem - people did it a lot and it was responsible for bad bugs.  The lesson: a type error is only a good thing if the path of least resistance to making the compiler STFU involves actually improving the code.)

An alternate approach is to agree to use size_t for sizes.  This is an easy to explain rule and it should be easy for reviewers to catch it.  It might also make that compiler warning more meaningful, in the sense that in a world where we are mixing size_t and unsigned we will *have* to have some casts between the two but in a world where we use size_t uniformly any such cast is an immediate red flag.

So, I think I'm starting to agree with the size_t approach.

I wonder what the downsides are to this approach.  Footprint of Vector?

-Filip

> On Nov 20, 2014, at 9:26 AM, Alexey Proskuryakov <ap at webkit.org> wrote:
> 
> 
> 19 нояб. 2014 г., в 14:58, Alexey Proskuryakov <ap at webkit.org <mailto:ap at webkit.org>> написал(а):
> 
>> These and related uses are all over the place - see also Vectors in FormDataBuilder, data returned from FrameLoader::loadResourceSynchronously, plug-in code that loads from network, SharedBuffer etc.
> 
> Another way to say this is: we need to deal with large data arrays throughout loading code. We do not really need full size vectors in most other code - it's sort of OK for HTML parser or for image decoder to fail catastrophically when there is too much data fed to it.
> 
> This is somewhat questionable design, but if we are going to stick to it, then magnitude checks should be centralized, not sprinkled throughout the code. We should not make this check explicitly when feeding a network resource to the parser, for example.
> 
> A 64-bit API for Vector solves this nearly flawlessly. We do not perform the checks manually every time we use a Vector, Vector does them for us.
> 
> Other options are:
> 
> - uint64_t everywhere. This way, we'll solve practical problems with large resources once and for all. Also, this may prove to be necessary to solve even YouTube/Google Drive uploads, I do not know that yet.
> 
> - size_t everywhere. Same effect on 64-bit platforms, while 32-bit ones will still be limited. I like this option, because it won't make us pay the memory and performance cost on old crippled 32-bit machines, which are unlikely to be used for manipulating huge volumes of data anyway.
> 
> We'll also need to develop some magic for 53-bit JS bindings, which I'm not sure about.
> 
> - Alexey
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.webkit.org/pipermail/webkit-dev/attachments/20141120/54268699/attachment-0001.html>