Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Option 2 - Hand it off to DevOps. The other option is to have data science produce prototypes that can be on Notebooks and then have a devops team whose job is to refactor those into an application that runs in production. This process makes things less fragile, but it is slow and very expensive.

I've never understood why this is so hard. Every time data science gives me a notebook it feels like I have been handed a function that says `doFeature()` and should just have to put it behind an endpoint called /do_feature, but it always takes forever and I'm never even able to articulate why. It feels like I am clueless at reading code but just this one particular kind of code.



A data scientist wants results with minimum programming effort, and efficiency be damned. Pull all the data and join it all together in a honking great data frame, use brute force to analyse it.

This isn’t necessarily what you want in a daily production environment, let alone a real-time environment.


Thanks for the comment: your frustration is the default in the industry, and it's part of the reasons why Bauplan was built.

"but it always takes forever and I'm never even able to articulate why." -> there are way more factors at play than DoFeatures unfortunately, see for example Table 1 (https://arxiv.org/pdf/2404.13682). Even knowing which data people have developed on is hard, which is why bauplan has git-for-data semantics built in: everyone works on production data, but safely and reliably, to avoid data skews.

Each computer is different, which is why bauplan adopt FaaS with isolated and fully containerized functions: you are always in the cloud, so no skew in the infra etc.

The problem of "going to production" is still the biggest issue in the industry, and solving it is not a one-fix kind of thing, but unfortunately the combination of good ergonomics, new abstractions and reliable infra.


I'll do you one better. Productionizing a data science prototype is exactly the kind of grunt work AI is able to take over.

I think its a much better result to have data science prototype translated to a performant production version rather than have a databricks type approach or what bauplan is proposing.


Maybe, but it would still need to work within a well defined framework. Usually the data science part is “solve the problem”, the data engineering part is “make it work reliably, fast, at scale”.

What that looks like is highly dependent upon the environment at hand, and letting AI take that over may be one of those “now you have 2 problems” things.


We are not proposing or advocating for any approach to development (I personally almost never use notebooks these days and run Bauplan with preview).

The blog together with our marimo friends is to showcase that you can have notebook development if you like it AND cloud scaling (which u need) without code changes, thanks to the fact that both marimo and Bauplan are basically Python (maybe a small thing, but there is nothing else in the market remotely close).

On the AI part, we agree: the fact that bauplan is just Python, including data management and infra-as-code, makes it trivial for AI to build pipelines in Bauplan, which is not something that can be said about other data platforms - if you follow our blog, we are releasing in a few weeks or so a full "agentic" implementation with Bauplan API of production ETL workloads, which you may find interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: