Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In my career I found several reasons not to use regular expressions for parsing an HTML response, but the largest was the fact that it may work for 'properly formed' documents, but you would be surprised how lax all browsers are about requiring the document to be well-formed. Your regex, unless particularly handled, will not be able to handle sites like this (and there are a lot, at least from my career experience). And you may be able to work 'edge cases' into your RegEx, but good luck finding anyone but the expression author who fully understands and can confidently change it as time goes on. It is also a PITA to debug when groupings/etc. aren't working (and there will be a LOT of these cases with HTML/XML documents).

It is honestly almost never worth it unless you have constraints on what packages you can use and you MUST use regular expressions. Just do your future-self a favor and use BeautifulSoup or some other package designed to parse the tree-like structure of these documents.

One way it can be used appropriately is just finding a pattern in the document- without caring where it is w.r.t. the rest of the document. But even then, do you really want to match: <!-- <div> --> ?



For all the things jQuery got wrong, it got one thing right: arguably the most intuitive way to target a set of data in a document is by having a concise DSL that works on a parsed representation of the document.

I'd love to see more innovation/developer-UX research on the interactions between regexes, document parse trees, and NLP. For instance, "match every verb phrase where the verb has similar meaning to 'call' within the context of a specific CSS selector, and be able to capture any data along that path in capturing groups, and do something with it" right now takes significant amounts of coding.

https://spacy.io/usage/rule-based-matching does a lot, but (a) it's not particularly concise, (b) there's not a standardized syntax for e.g. replacement strings once you detect something, and (c) there's no real facilities to bake in a knowledge of hierarchy within a larger markup-language document.


I think scraping is just inherently brittle whether you go by the DOM traversal or by regex. AI may have the best potential. Regex can be slightly more brittle like you point out with commented html or myriad other problems, but it can also be less brittle than DOM if you craft more lenient patterns. The main problem I found was regex's not being performant due to recursiveness and stack overflows (Google's RE2 lib addresses this). My favorite performance trick is to use negated character classes rather than a dot, /<foo[^>]*>/


> confidently change it

Having a good variety of tests helps.

> tree structure

You'll need a complete language to parse a tree.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: