Tuesday 18 May 2010

[Perl] Removig spaces from strings

Thanks to Fred Moyer in the PerlMongers at LinkedIn I learned about the existence of the module:
String::Strip

It uses XS to remove spaces from a string and claims to be 35% faster.

You can also trim your text withText::Trim.

I would prefer to use modules for typical patterns but here a simple s/^\s+|\s+$//g will do the trick and I have it in my Emacs macros for a "sub trim {...}" ;-). This would be nanosecond slower than other approximations but trimming is not my bottleneck, db accession is :-(.
[update 210-06-13]
Tom Christiansen in "Perl Cookbook, 2nd Edition"

--Recipe 1.19. Trimming Blanks from the Ends of a String--

show another aproximation:
If the function isn't passed any arguments at all, it could act like chop and chomp by defaulting to $_. Incorporating all of these embellishments produces this function:
# 1. trim leading and trailing white space
# 2. collapse internal whitespace to single space each
# 3. take input from $_ if no arguments given
# 4. join return list into single scalar with intervening spaces 
#     if return is scalar context

sub trim {
    my @out = @_ ? @_ : $_;
    $_ = join(' ', split(' ')) for @out;
    return wantarray ? @out : "@out";
}

PD:
I would need to update my trim sub at some point (following PBP) to something like
s/A\s+|\s+z//gms
But I would need to check that it would do the right thing, and remove spaces before and after "\n"

* Posted by David Bouman in perlmongers at linkedIn:
sub trim { return unpack 'A*', reverse unpack 'A*', reverse shift }

* Posted by Tobie Van Der Merwe in perlmongers at linkedIn:

Tobie solves the problem doing more work that needed and instead doing eliminating white space directly he captures the text in between. Probably this is not elegant and a bit complicated (see Gabor reply) but on the other hand he has shown a good perl attitude and one not so good.

The good one: always give a test case and the code to proof your code. Probably he had not the knowledge for a better regex but he had a good attitude .

The not so good one: if you do a complex regex with some tricky parts (non greedy quantifiers) you should use /x and put comments.

[Tobie] I think it can be solved with a regex -

# Short program to test left and right
# trim regex - s/^[\s]*(.*?)[\s]*$/$1/

my %tests = ( 1 => 'This is great',
2 => 'This is great ',
3 => ' This is great',
4 => ' This is great ',
5 => ' 12345AA 22277 ',
6 => ' !^%&^"$ ^777% ');

foreach my $num (sort keys (%tests)) {
print "BEFORE[" . $tests{$num} . "]\n";
$tests{$num} =~ s/^[\s]*(.*?)[\s]*$/$1/;
print "AFTER [" . $tests{$num} . "]\n";
}


* Gábor Szabó reply:

Looking at s/^[\s]*(.*?)[\s]*$/$1/
besides the fact that the square brackets [] around the \s are not necessary and only make noise it one of the examples I am using to show that you do NOT have to do everything with one regex.

This solution is both complex and error prone - sanjeev indeed missed out on the ? that turns the otherwise greedy quantifier into minimal matching. You can of course use as TMTOWDI but then don't be surprised if people think Perl is cryptic.

Lastly, really, is trimming whitespace such an important task that it justifies 60 posts?

2 comments:

Anonymous said...

Even if it's not your bottleneck, if you're inserting the code by macro you might as well use the faster solution:

s/^\s+//g;
s/\s+$//g;

It's over 3 times faster even in the trivial case, and scales far better.

If you're inserting it by macro then it's no extra effort to use. ;)

I've covered this with hard figures in exhaustive detail in Advanced Benchmark Analysis I: Yet more white-space trimming on my perl blog.

As you say, not your bottleneck, but it's a trivial difference in effort and a good habit to form.

Pablo Marin-Garcia said...

@Illusori,
thanks, I remembered having read your blog entry and the flame war about optimization/profiling that your entry and a similar one in other blog started. But I was unable to remember where I read it. There was where I learnt about the speed difference and this is why I mention that the two lines option was faster but had no time to find the source so I leave that way, thanks for refreshing my memory.

Your point about the macro is right . I can not help it,I like "perlgolfing" ;-).