Home

July 3rd, 2008

A reducer for uniq -c

  • Jul. 3rd, 2008 at 4:27 PM
If you think of "grep | sort | uniq -c" as a "map" operation, then I wrote a "reducer" for that. I do enough counting with grep | sort | uniq -c on really large files that I finally broke down and wrote a simple perl script that in some cases can dramatically speed up a query. For example:


Compare this:


$ time zcat /tmp/29M_of_logfiles | cut -d ' ' -f 1 | sort | uniq -c
[results here]

real 1m36.721s
user 1m38.942s
sys 0m1.600s


To this:


$ time zcat /tmp/29M_of_logfiles | cut -d ' ' -f 1 | uniq -c | perl uniq_sum.pl
[exact same results here]

real 0m17.146s
user 0m15.437s
sys 0m0.980s


Here's the source code, I wrote it at PBwiki, but I'm putting it online with permission from David Weekly.


#!/usr/bin/perl
# uniq_sum.pl: a "reducer" for 'uniq -c'
#
# for example: you could do this:
# $ grep THING * | sort | uniq -c
#
# but this would be faster (because you don't need to sort lots of lines);
# $ (for file in *; do grep THING $file | uniq -c; done) | perl uniq_sum.pl
#
# Copyright 2008 PBwiki, Inc
#
# Author: Joel Franusic

my $field_length = 0;
my %hash;

while(<>) {
chomp;
next unless($_ =~ /^(\s+\d+)\s(\S.+)$/);
my @field = split '', $1;
$field_length = $#field if($#field > $field_length);
$hash{$2} += $1;
}

foreach my $key (sort keys %hash) {
printf("%*d %s\n", $field_length, $hash{$key}, $key);
}

Tags:

Profile

[info]joel
Joel Franusic
Website

Advertisement

Latest Month

March 2009
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031    
Powered by LiveJournal.com
Designed by Lilia Ahner