[webkit-help] Regular expressions for content blocking

Benjamin Poulain benjamin at webkit.org
Mon Aug 17 12:59:40 PDT 2015


Hi Romain,

On 8/17/15 11:03 AM, Romain Jacquinot wrote:
> For now, the following regular expression features are supported by
> content blockers:
>
>   * Matching any character with “.”.
>   * Matching ranges with the range syntax [a-b].
>   * Quantifying expressions with “?”, “+” and “*”.
>   * Groups with parenthesis.
>   * Beginning of line (“^”) and end of line (“$”) marker
>
> However, there doesn’t seem to be a way to find any of the alternatives
> specified with “|” or find any character not between the brackets "[^]”.

Actually the "[^]" character set syntax is supported.

It could cause compile time issues on previous betas. That has been 
fixed in beta 5.

> This is an issue when you want to block addresses like
> *http://www.example.com <http://example.com>/*,
> *https://example.com/*foobar.jpg, *http://example.com:*8080 but not
> *http://example.com**.*hk.

The URLs are canonicalized before being processed by Content Blockers. 
That ensure some invariants on the format. For example, the end of the 
domain name always ends  with ":" or "/". The domain name is always 
lowercase.

Typically, I write domain triggers like this:

"trigger": {
     "url-filter": "^https://([^:/]+\\.)example.com[:/]",
     "url-filter-is-case-sensitive": true
}


> With at least one of those features, you could write something like:
>
>      {
> "action" : {
> "type" : "block"
>          },
> "trigger" : {
> "url-filter": "^https?://(www\\.)?example\\.com(/|:|?)+"

This does not work but
     "^https?://(www\\.)?example\\.com[/:?]+"
is equivalent.

>          }
>      }
>
> or:
>
>      {
> "action" : {
> "type" : "block"
>          },
> "trigger" : {
> "url-filter" : "^https?://(www\\.)?example\\.com[^.]"

This pattern should work fine in beta 5.

>          }
>      }
>
> Please note that in this case, the if-domain field wouldn’t help for
> embedded content.
>
> Should I write the same rule many times for the different cases (“/",
> “:", “?”)? (doesn’t feel like a very elegant solution though). Since
> they share the same prefix, will these rules be optimized? On the webkit
> blog, it is written "/The rules are grouped by the prefix “https?://,
> and it only counts as one rule with quantifiers./”. Does it mean that it
> will only count as one rule against the 50,000 rule limit?

Having 3 rules with 3 different ending is fine as long as they are not 
quantified. Their prefix would be merged in the compiler frontend.

Having 3 rules with quantifiers per URL would likely cause your rules to 
be rejected by the compiler even under the 50k rule limit.

In any case, the 50k rule limit is on the number of trigger. The number 
of rule is counted before rules are merged.

> Do you see an elegant solution to handle this case? If not, could you
> please consider adding at least one of those regular expression features
> for content blockers in Safari?

Are the solutions above good enough for your use case?

Benjamin


More information about the webkit-help mailing list