Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Author here, happy to answer any questions

For our product (PixieBrix) we actually generally grab the data directly from the front-end framework (e.g., React props). It's a bit less stable since it's effectively an internal API, but it means you can grab a lot of data with a single selector and can generally avoid parsing values out of text



More than an internal API, unless the app is compiled in debug mode, don't you get compiled/name-mangled/tree-shaken code and symbols?

I would assume this might change on recompiles or at least library updates, never mind internal code changes. Do you find that it works in practice?


For React and similar frameworks, the component names get minimized. In practice, JS compilers/bundlers can't mangle property names because 1) alias analysis is hard, and 2) property name string logic is ubiquitous in JS 3) the data often flows from APIs, and mangling would make API maintenance hard. Google's closure compiler and other static compilers are an issue.

As I mentioned in the post, dynamic CSS classnames are also tricky depending on how much gets mangled. We have some techniques in the pipeline for better handling those


I see, so you see class and function names that make no sense, but they get called with nice clean JSON objects. Neat trick!


What is your experience with other framework than React? Which one is the least stable, i.e. hardest to scrape?


Great question. Our general approach is to look up the devtools browser extension for the framework, and use that as a reference point for determining how to interface with the framework

The most popular framework we haven't implemented support for yet is Angular. (AngularJS, the old version, is straightforward.) Any of the compiled frameworks, e.g., Google Closure Compiler are difficult because they mangle identifiers. I suspect Svelte might also be tricky, but we haven't tried that yet

At the end of the day though, every framework has to write to the DOM and be accessible. So you can use selectors, or in the worst case OCR/computer vision. (IIRC, FB actively inserts dummy elements to try to prevent structural scraping).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: