[webkit-help] Regular expressions for content blocking
Romain Jacquinot
rjacquinot at me.com
Mon Aug 17 15:19:59 PDT 2015
Thank you very much Alex and Benjamin. Your answers were really helpful.
I wrongly thought the bracket syntax was only allowed for very basic ranges like [0-9] or [a-z] since the "Introduction to WebKit Content Blockers” only mentioned "Matching ranges with the range syntax [a-b]”.
I’m glad to know the full bracket syntax is actually supported.
Romain
> On Aug 17, 2015, at 9:59 PM, Benjamin Poulain <benjamin at webkit.org> wrote:
>
> Hi Romain,
>
> On 8/17/15 11:03 AM, Romain Jacquinot wrote:
>> For now, the following regular expression features are supported by
>> content blockers:
>>
>> * Matching any character with “.”.
>> * Matching ranges with the range syntax [a-b].
>> * Quantifying expressions with “?”, “+” and “*”.
>> * Groups with parenthesis.
>> * Beginning of line (“^”) and end of line (“$”) marker
>>
>> However, there doesn’t seem to be a way to find any of the alternatives
>> specified with “|” or find any character not between the brackets "[^]”.
>
> Actually the "[^]" character set syntax is supported.
>
> It could cause compile time issues on previous betas. That has been fixed in beta 5.
>
>> This is an issue when you want to block addresses like
>> *http://www.example.com <http://example.com>/*,
>> *https://example.com/*foobar.jpg, *http://example.com:*8080 but not
>> *http://example.com**.*hk.
>
> The URLs are canonicalized before being processed by Content Blockers. That ensure some invariants on the format. For example, the end of the domain name always ends with ":" or "/". The domain name is always lowercase.
>
> Typically, I write domain triggers like this:
>
> "trigger": {
> "url-filter": "^https://([^:/]+\\.)example.com[:/]",
> "url-filter-is-case-sensitive": true
> }
>
>
>> With at least one of those features, you could write something like:
>>
>> {
>> "action" : {
>> "type" : "block"
>> },
>> "trigger" : {
>> "url-filter": "^https?://(www\\.)?example\\.com(/|:|?)+"
>
> This does not work but
> "^https?://(www\\.)?example\\.com[/:?]+"
> is equivalent.
>
>> }
>> }
>>
>> or:
>>
>> {
>> "action" : {
>> "type" : "block"
>> },
>> "trigger" : {
>> "url-filter" : "^https?://(www\\.)?example\\.com[^.]"
>
> This pattern should work fine in beta 5.
>
>> }
>> }
>>
>> Please note that in this case, the if-domain field wouldn’t help for
>> embedded content.
>>
>> Should I write the same rule many times for the different cases (“/",
>> “:", “?”)? (doesn’t feel like a very elegant solution though). Since
>> they share the same prefix, will these rules be optimized? On the webkit
>> blog, it is written "/The rules are grouped by the prefix “https?://,
>> and it only counts as one rule with quantifiers./”. Does it mean that it
>> will only count as one rule against the 50,000 rule limit?
>
> Having 3 rules with 3 different ending is fine as long as they are not quantified. Their prefix would be merged in the compiler frontend.
>
> Having 3 rules with quantifiers per URL would likely cause your rules to be rejected by the compiler even under the 50k rule limit.
>
> In any case, the 50k rule limit is on the number of trigger. The number of rule is counted before rules are merged.
>
>> Do you see an elegant solution to handle this case? If not, could you
>> please consider adding at least one of those regular expression features
>> for content blockers in Safari?
>
> Are the solutions above good enough for your use case?
>
> Benjamin
On Aug 17, 2015, at 8:48 PM, Alex Christensen <achristensen at apple.com> wrote:
> On Aug 17, 2015, at 11:03 AM, Romain Jacquinot <rjacquinot at me.com <mailto:rjacquinot at me.com>> wrote:
>
> Hi,
>
> For now, the following regular expression features are supported by content blockers:
> Matching any character with “.”.
> Matching ranges with the range syntax [a-b].
> Quantifying expressions with “?”, “+” and “*”.
> Groups with parenthesis.
> Beginning of line (“^”) and end of line (“$”) marker
> However, there doesn’t seem to be a way to find any of the alternatives specified with “|” or find any character not between the brackets "[^]”.
| is indeed not implemented yet.
If I’m not mistaken, [^a] should work, though. You could always do tricky things with ranges, like [\u0001-.0-9;->@-\u007F] but this doesn’t read very well and it might lead to hard-to-find errors for those of us that don’t have ASCII memorized.
>
> This is an issue when you want to block addresses like http://www <http://www/>.example.com <http://example.com/>/, https://example.com <https://example.com/>/foobar.jpg, http://example.com <http://example.com/>:8080 but not http://example.com <http://example.com/>.hk.
>
> With at least one of those features, you could write something like:
>
> {
> "action" : {
> "type" : "block"
> },
> "trigger" : {
> "url-filter" : "^https?://(www\\.)?example\\.com(/|:|?)+"
> }
> }
>
> or:
>
> {
> "action" : {
> "type" : "block"
> },
> "trigger" : {
> "url-filter" : "^https?://(www\\.)?example\\.com[^.]"
> }
> }
>
> Please note that in this case, the if-domain field wouldn’t help for embedded content.
>
> Should I write the same rule many times for the different cases (“/", “:", “?”)? (doesn’t feel like a very elegant solution though). Since they share the same prefix, will these rules be optimized? On the webkit blog, it is written "The rules are grouped by the prefix “https?://, and it only counts as one rule with quantifiers.”. Does it mean that it will only count as one rule against the 50,000 rule limit?
Rules sharing a prefix are combined into the same DFA when compiling the combined regular expressions. Fewer DFAs means faster performance. A prefix in this case is all the terms of a regular expression up to the last quantified term, so ab?c and ab?d would be combined into the same DFA and there wouldn’t be much of a performance penalty for adding more regular expressions with ab? at the beginning and no other quantified terms, but ab?cd?e has another quantified term, so it would be put into a separate DFA in our implementation. In your case, if all your rules start with ^https? with no other quantified terms, then they will all be optimized well, but if all the rules have unique terms before the last quantified term like ^https?://a\.(com)? ^https://b <https://b/>\.(com)? ^https://c <https://c/>\.(com)? etc. then these rules will not be combined well and it will hurt performance when checking if a URL matches the rules. To make it simple, the less you use ?, *, or +, the faster it will be.
You could write a rule many times, but the 50000 rule limit applies when parsing the rules, so each rule will count towards that limit.
>
> Do you see an elegant solution to handle this case? If not, could you please consider adding at least one of those regular expression features for content blockers in Safari?
You could do something like ^https?://(www\.)?example\.com[/:?]
>
> Thanks.
>
> _______________________________________________
> webkit-help mailing list
> webkit-help at lists.webkit.org <mailto:webkit-help at lists.webkit.org>
> https://lists.webkit.org/mailman/listinfo/webkit-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.webkit.org/pipermail/webkit-help/attachments/20150818/f09881a7/attachment-0001.html>
More information about the webkit-help
mailing list