In my career I found several reasons not to use regular expressions for parsing ...

btown · on Feb 10, 2021

For all the things jQuery got wrong, it got one thing right: arguably the most intuitive way to target a set of data in a document is by having a concise DSL that works on a parsed representation of the document.

I'd love to see more innovation/developer-UX research on the interactions between regexes, document parse trees, and NLP. For instance, "match every verb phrase where the verb has similar meaning to 'call' within the context of a specific CSS selector, and be able to capture any data along that path in capturing groups, and do something with it" right now takes significant amounts of coding.

https://spacy.io/usage/rule-based-matching does a lot, but (a) it's not particularly concise, (b) there's not a standardized syntax for e.g. replacement strings once you detect something, and (c) there's no real facilities to bake in a knowledge of hierarchy within a larger markup-language document.

logn · on Feb 11, 2021

I think scraping is just inherently brittle whether you go by the DOM traversal or by regex. AI may have the best potential. Regex can be slightly more brittle like you point out with commented html or myriad other problems, but it can also be less brittle than DOM if you craft more lenient patterns. The main problem I found was regex's not being performant due to recursiveness and stack overflows (Google's RE2 lib addresses this). My favorite performance trick is to use negated character classes rather than a dot, /<foo[^>]*>/

xapata · on Feb 10, 2021

> confidently change it

Having a good variety of tests helps.

> tree structure

You'll need a complete language to parse a tree.