July 3rd, 2008
If you think of "grep | sort | uniq -c" as a "map" operation, then I wrote a "reducer" for that. I do enough counting with grep | sort | uniq -c on really large files that I finally broke down and wrote a simple perl script that in some cases can dramatically speed up a query. For example:
Compare this:
$ time zcat /tmp/29M_of_logfiles | cut -d ' ' -f 1 | sort | uniq -c
[results here]
real 1m36.721s
user 1m38.942s
sys 0m1.600s
To this:
$ time zcat /tmp/29M_of_logfiles | cut -d ' ' -f 1 | uniq -c | perl uniq_sum.pl
[exact same results here]
real 0m17.146s
user 0m15.437s
sys 0m0.980s
Here's the source code, I wrote it at PBwiki, but I'm putting it online with permission from David Weekly.
#!/usr/bin/perl
# uniq_sum.pl: a "reducer" for 'uniq -c'
#
# for example: you could do this:
# $ grep THING * | sort | uniq -c
#
# but this would be faster (because you don't need to sort lots of lines);
# $ (for file in *; do grep THING $file | uniq -c; done) | perl uniq_sum.pl
#
# Copyright 2008 PBwiki, Inc
#
# Author: Joel Franusic
my $field_length = 0;
my %hash;
while(<>) {
chomp;
next unless($_ =~ /^(\s+\d+)\s(\S.+)$/);
my @field = split '', $1;
$field_length = $#field if($#field > $field_length);
$hash{$2} += $1;
}
foreach my $key (sort keys %hash) {
printf("%*d %s\n", $field_length, $hash{$key}, $key);
}
Compare this:
$ time zcat /tmp/29M_of_logfiles | cut -d ' ' -f 1 | sort | uniq -c
[results here]
real 1m36.721s
user 1m38.942s
sys 0m1.600s
To this:
$ time zcat /tmp/29M_of_logfiles | cut -d ' ' -f 1 | uniq -c | perl uniq_sum.pl
[exact same results here]
real 0m17.146s
user 0m15.437s
sys 0m0.980s
Here's the source code, I wrote it at PBwiki, but I'm putting it online with permission from David Weekly.
#!/usr/bin/perl
# uniq_sum.pl: a "reducer" for 'uniq -c'
#
# for example: you could do this:
# $ grep THING * | sort | uniq -c
#
# but this would be faster (because you don't need to sort lots of lines);
# $ (for file in *; do grep THING $file | uniq -c; done) | perl uniq_sum.pl
#
# Copyright 2008 PBwiki, Inc
#
# Author: Joel Franusic
my $field_length = 0;
my %hash;
while(<>) {
chomp;
next unless($_ =~ /^(\s+\d+)\s(\S.+)$/);
my @field = split '', $1;
$field_length = $#field if($#field > $field_length);
$hash{$2} += $1;
}
foreach my $key (sort keys %hash) {
printf("%*d %s\n", $field_length, $hash{$key}, $key);
}
