It's funny...when I read that bit, all I could think was, how much nicer would it be if regexes could just be written without the special characters, using sane word-based self-documenting tokens the way we do the rest of our programming, following scoping and quotation rules that actually mesh with the languages we're using them in?
Then it occurred to me that plenty of people have probably written libraries to do exactly that, but nobody uses them because we all already have regexes built-in almost everywhere that we want to use them. Hell, I've never even looked for one, even though I choke back a little vomit every time I introduce a new regex into a codebase because of how much future debugging pain I know they can cause (all but the shortest ones force what's essentially a full context-shift in order to parse, and in reality what usually happens is people scan a regex as one chunk and say "Eh, a regex, it's probably right, hopefully my bug is somewhere else..." until they have some concrete reason to think otherwise).
Sort of a shame, really, that such a problematically condensed syntax won the prize so early on, and now even those of us that hate it are too comfortable with it to look for something better.
Regex aren't difficult to learn, it's just nobody teaches them as a language with a base syntax and words to use. If you just sit down and memorize the names of a few symbols, then learn what each does, then it becomes fairly clear.
It's my belief (totally unfounded) that learning a simple symbolic language like regular expressions teaches you how to handle other symbolic languages like mathematics, chemistry, and programming. That's one of the reasons I'm teaching it and trying to get other people to use it.
More importantly though, they are damn handy. As long as you don't abuse them in places where a lexer+parser is better, you can get a lot done with very little regex in very short time.
Unfortunately, the actual syntax of regexps is far from ideal. As an example, it’s completely stupid that non-capturing groups must be written as (?:…) when they are by far the common case.
I don't consider that a failing of regular expressions (which predate perl), but a failure of implementation. I've thought that instead of syntax it should be an API option that says "this is a matcher" vs. "this is a capture". Then the same regex works for both, it's just how you run it.
So would it not be possible to mix capturing and non-capturing groups in the same regex? That's useful to do if you have code that expects specific things to be captured at specific group indices, and you need to add grouping somewhere else in the regex without messing it up.
::Regex aren't difficult to learn, it's just nobody teaches them as a language with a base syntax and words to use::
I couldn't agree more. because of lack of the "language" approach to them, their weird syntax, and the fact that the "verbose" mode for their definition is almost unknown, they come out as a sort of voodoo that only gurus can handle. Moreover, this results in tons of broken code in production. They're simple, beautiful and handy, but have an unfortunate historical load.
What I miss the most about regexes (and I think is kind of what bermanoid was hinting at) is that we don't have access to much of the expressiveness we usually have available in a programming language. For example, I never saw a widely used regex library that takes advantage of the algebraic structure of regular expressions and that would let me do things like incrementally building regexes or creating named constants:
var regex1 = /some_regex/,
regex2 = /other_regex/;
var regex3 = alternative(regex1, regex2);
var regex4 = kleene_star( sequence(regex1, regex2) );
Then it occurred to me that plenty of people have probably written libraries to do exactly that, but nobody uses them because we all already have regexes built-in almost everywhere that we want to use them. Hell, I've never even looked for one, even though I choke back a little vomit every time I introduce a new regex into a codebase because of how much future debugging pain I know they can cause (all but the shortest ones force what's essentially a full context-shift in order to parse, and in reality what usually happens is people scan a regex as one chunk and say "Eh, a regex, it's probably right, hopefully my bug is somewhere else..." until they have some concrete reason to think otherwise).
Sort of a shame, really, that such a problematically condensed syntax won the prize so early on, and now even those of us that hate it are too comfortable with it to look for something better.