Potential grep bug?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Potential grep bug?

Jordan Geoghegan-3
Hello,

I was working on a couple POSIX regular expressions to search for and
validate IPv4 and IPv6 addresses with optional CIDR blocks, and
encountered some strange behaviour from the base system grep.

I wanted to validate my regex against a list of every valid IPv4
address, so I generated a list with a zsh 1 liner:

      for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
tr '[:space:]' '\n' > IPv4.txt

My intentions were to test the regex by running it with 'grep -c' to
confirm there was indeed 2^32 addresses matched, and I also wanted to
benchmark and compare performance between BSD grep, GNU grep and
ripgrep. The command I used:

    grep -Eoc
"((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"

My findings were surprising. Both GNU grep and ripgrep were able get
through the file in roughly 10 and 20 minutes respectively, whereas the
base system grep took over 20 hours! What interested me the most was
that the base system grep when run with '-c' returned '0' for match
count. It seems that 'grep -c' will have its counter overflow if there
are more than 2^32-1 matches (4294967295) and then the counter will
start counting from zero again for further matches.

     ryzen$ time zcat IPv4.txt.gz | grep -Eoc "((25[0-5]|(2[0-4]|1{0,1}...
     0
     1222m09.32s real  1224m28.02s user     1m16.17s system

     ryzen$ time zcat allip.txt.gz | ggrep -Eoc "((25[0-5]|(2[0-4]|1{0,1}...
     4294967296
     10m00.38s real    11m40.57s user     0m30.55s system

     ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
     4294967296
     21m06.36s real    27m06.04s user     0m50.08s system

# See the counter overflow/reset:
     jot 4294967350 | grep -c "^[[:digit:]]"
     54

All testing was done on a Ryzen desktop machine running 6.7 stable.

The grep counting bug can be reproduced with this command:
    jot 4294967296 | nice grep -c "^[[:digit:]]"

Regards,

Jordan

Reply | Threaded
Open this post in threaded view
|

Re: Potential grep bug?

Martijn van Duren-9
This seems to fix the issue for me.

OK?

martijn@

On Tue, 2020-06-23 at 19:29 -0700, Jordan Geoghegan wrote:

> Hello,
>
> I was working on a couple POSIX regular expressions to search for and
> validate IPv4 and IPv6 addresses with optional CIDR blocks, and
> encountered some strange behaviour from the base system grep.
>
> I wanted to validate my regex against a list of every valid IPv4
> address, so I generated a list with a zsh 1 liner:
>
>       for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
> tr '[:space:]' '\n' > IPv4.txt
>
> My intentions were to test the regex by running it with 'grep -c' to
> confirm there was indeed 2^32 addresses matched, and I also wanted to
> benchmark and compare performance between BSD grep, GNU grep and
> ripgrep. The command I used:
>
>     grep -Eoc
> "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
>
> My findings were surprising. Both GNU grep and ripgrep were able get
> through the file in roughly 10 and 20 minutes respectively, whereas the
> base system grep took over 20 hours! What interested me the most was
> that the base system grep when run with '-c' returned '0' for match
> count. It seems that 'grep -c' will have its counter overflow if there
> are more than 2^32-1 matches (4294967295) and then the counter will
> start counting from zero again for further matches.
>
>      ryzen$ time zcat IPv4.txt.gz | grep -Eoc "((25[0-5]|(2[0-4]|1{0,1}...
>      0
>      1222m09.32s real  1224m28.02s user     1m16.17s system
>
>      ryzen$ time zcat allip.txt.gz | ggrep -Eoc "((25[0-5]|(2[0-4]|1{0,1}...
>      4294967296
>      10m00.38s real    11m40.57s user     0m30.55s system
>
>      ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
>      4294967296
>      21m06.36s real    27m06.04s user     0m50.08s system
>
> # See the counter overflow/reset:
>      jot 4294967350 | grep -c "^[[:digit:]]"
>      54
>
> All testing was done on a Ryzen desktop machine running 6.7 stable.
>
> The grep counting bug can be reproduced with this command:
>     jot 4294967296 | nice grep -c "^[[:digit:]]"
>
> Regards,
>
> Jordan
>
Index: util.c
===================================================================
RCS file: /cvs/src/usr.bin/grep/util.c,v
retrieving revision 1.62
diff -u -p -r1.62 util.c
--- util.c 3 Dec 2019 09:14:37 -0000 1.62
+++ util.c 24 Jun 2020 06:46:52 -0000
@@ -106,7 +106,8 @@ procfile(char *fn)
 {
  str_t ln;
  file_t *f;
- int c, t, z, nottext;
+ int t, z, nottext;
+ unsigned long long c;
 
  mcount = mlimit;
 
@@ -169,7 +170,7 @@ procfile(char *fn)
  if (cflag) {
  if (!hflag)
  printf("%s:", ln.file);
- printf("%u\n", c);
+ printf("%llu\n", c);
  }
  if (lflag && c != 0)
  printf("%s\n", fn);

Reply | Threaded
Open this post in threaded view
|

Re: Potential grep bug?

Philip Guenther-2
Nope.  This is a grep of a single file, so procfile() must be overflowing
and this only 'fixes' it by relying on signed overflow, which is undefined
behavior, being handled in a particular way by the compiler.  So, luck
(which fails when the compiler decides to hate you).  There are more places
that need to change for the reported problem to be handled safely.

Philip Guenther


On Tue, Jun 23, 2020 at 9:58 PM Martijn van Duren <
[hidden email]> wrote:

> This seems to fix the issue for me.
>
> OK?
>
> martijn@
>
> On Tue, 2020-06-23 at 19:29 -0700, Jordan Geoghegan wrote:
> > Hello,
> >
> > I was working on a couple POSIX regular expressions to search for and
> > validate IPv4 and IPv6 addresses with optional CIDR blocks, and
> > encountered some strange behaviour from the base system grep.
> >
> > I wanted to validate my regex against a list of every valid IPv4
> > address, so I generated a list with a zsh 1 liner:
> >
> >       for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done |
> > tr '[:space:]' '\n' > IPv4.txt
> >
> > My intentions were to test the regex by running it with 'grep -c' to
> > confirm there was indeed 2^32 addresses matched, and I also wanted to
> > benchmark and compare performance between BSD grep, GNU grep and
> > ripgrep. The command I used:
> >
> >     grep -Eoc
> >
> "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
> >
> > My findings were surprising. Both GNU grep and ripgrep were able get
> > through the file in roughly 10 and 20 minutes respectively, whereas the
> > base system grep took over 20 hours! What interested me the most was
> > that the base system grep when run with '-c' returned '0' for match
> > count. It seems that 'grep -c' will have its counter overflow if there
> > are more than 2^32-1 matches (4294967295) and then the counter will
> > start counting from zero again for further matches.
> >
> >      ryzen$ time zcat IPv4.txt.gz | grep -Eoc
> "((25[0-5]|(2[0-4]|1{0,1}...
> >      0
> >      1222m09.32s real  1224m28.02s user     1m16.17s system
> >
> >      ryzen$ time zcat allip.txt.gz | ggrep -Eoc
> "((25[0-5]|(2[0-4]|1{0,1}...
> >      4294967296
> >      10m00.38s real    11m40.57s user     0m30.55s system
> >
> >      ryzen$ time rg -zoc "((25[0-5]|(2[0-4]|1{0,1}...
> >      4294967296
> >      21m06.36s real    27m06.04s user     0m50.08s system
> >
> > # See the counter overflow/reset:
> >      jot 4294967350 | grep -c "^[[:digit:]]"
> >      54
> >
> > All testing was done on a Ryzen desktop machine running 6.7 stable.
> >
> > The grep counting bug can be reproduced with this command:
> >     jot 4294967296 | nice grep -c "^[[:digit:]]"
> >
> > Regards,
> >
> > Jordan
> >
> Index: util.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/grep/util.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 util.c
> --- util.c      3 Dec 2019 09:14:37 -0000       1.62
> +++ util.c      24 Jun 2020 06:46:52 -0000
> @@ -106,7 +106,8 @@ procfile(char *fn)
>  {
>         str_t ln;
>         file_t *f;
> -       int c, t, z, nottext;
> +       int t, z, nottext;
> +       unsigned long long c;
>
>         mcount = mlimit;
>
> @@ -169,7 +170,7 @@ procfile(char *fn)
>         if (cflag) {
>                 if (!hflag)
>                         printf("%s:", ln.file);
> -               printf("%u\n", c);
> +               printf("%llu\n", c);
>         }
>         if (lflag && c != 0)
>                 printf("%s\n", fn);
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Potential grep bug?

Demi M. Obenour
In reply to this post by Jordan Geoghegan-3
On 2020-06-23 22:29, Jordan Geoghegan wrote:

> Hello,
>
> I was working on a couple POSIX regular expressions to search for and validate IPv4 and IPv6 addresses with optional CIDR blocks, and encountered some strange behaviour from the base system grep.
>
> I wanted to validate my regex against a list of every valid IPv4 address, so I generated a list with a zsh 1 liner:
>
>      for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done | tr '[:space:]' '\n' > IPv4.txt
>
> My intentions were to test the regex by running it with 'grep -c' to confirm there was indeed 2^32 addresses matched, and I also wanted to benchmark and compare performance between BSD grep, GNU grep and ripgrep. The command I used:
>
>    grep -Eoc "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
>
> My findings were surprising. Both GNU grep and ripgrep were able get through the file in roughly 10 and 20 minutes respectively, whereas the base system grep took over 20 hours! What interested me the most was that the base system grep when run with '-c' returned '0' for match count. It seems that 'grep -c' will have its counter overflow if there are more than 2^32-1 matches (4294967295) and then the counter will start counting from zero again for further matches.
Does OpenBSD use an NFA/DFA regular expression implementation, or a
backtracking one?  If it uses the latter, then your regex is probably
causing catastrophic backtracking.

> Regards,
>
> Jordan

Sincerely,

Demi


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Potential grep bug?

Jordan Geoghegan-3
Hi Demi,

On 2020-07-10 22:42, Demi M. Obenour wrote:

> On 2020-06-23 22:29, Jordan Geoghegan wrote:
>> Hello,
>>
>> I was working on a couple POSIX regular expressions to search for and validate IPv4 and IPv6 addresses with optional CIDR blocks, and encountered some strange behaviour from the base system grep.
>>
>> I wanted to validate my regex against a list of every valid IPv4 address, so I generated a list with a zsh 1 liner:
>>
>>       for i in {0..255}; do; echo $i.{0..255}.{0..255}.{0..255} ; done | tr '[:space:]' '\n' > IPv4.txt
>>
>> My intentions were to test the regex by running it with 'grep -c' to confirm there was indeed 2^32 addresses matched, and I also wanted to benchmark and compare performance between BSD grep, GNU grep and ripgrep. The command I used:
>>
>>     grep -Eoc "((25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[[:digit:]]){0,1}[[:digit:]])(/[1-9]|/[1-2][[:digit:]]|/3[0-2])?"
>>
>> My findings were surprising. Both GNU grep and ripgrep were able get through the file in roughly 10 and 20 minutes respectively, whereas the base system grep took over 20 hours! What interested me the most was that the base system grep when run with '-c' returned '0' for match count. It seems that 'grep -c' will have its counter overflow if there are more than 2^32-1 matches (4294967295) and then the counter will start counting from zero again for further matches.
> Does OpenBSD use an NFA/DFA regular expression implementation, or a
> backtracking one?  If it uses the latter, then your regex is probably
> causing catastrophic backtracking.
>
>> Regards,
>>
>> Jordan
> Sincerely,
>
> Demi
>

Ya you're probably right, I know GNU grep does a whole bunch of fancy
optimizations that aren't done in BSD grep, the old GNU grep maintainer
did a small write up on the FreeBSD mailing list about some of their tricks:
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

My IPv6 regex causes grep to randomly segfault on MacOS (which uses an
older version of the BSD/FreeGrep that OpenBSD uses I believe), so I
imagine there's something going on under the hood with these regexes
making grep thrash hard.


Regards,

Jordan