I've been thinking a lot about how I manage my own data lately (notes, photos, code, reference material, etc) and have concluded that the primary feature I'm looking for is longevity. I'm saddened by the amount of data I've lost over the years, either because of hard disk failures or third-party services going out of business/making it difficult to extract things/getting too expensive.
In light of this, I'm biasing toward simple file formats managed by tools I write myself, and optimizing for cost in a way that I otherwise don't, since any recurring costs incurred by the system are effectively a lifelong commitment. I am relying on S3 for primary storage (so that it is accessible anywhere) but with a sync to offline backup.
So far, I've implemented a personal Zettelkasten tool (with built-in spaced repetition, so doubles as an Anki replacement) and a search engine that's based on Presto (via AWS Athena) so that I don't need to keep an Elasticsearch instance alive. I'm planning to build out other repository tools as I go.
It's been very liberating to build tools that are never meant to be used by anyone other than myself, and with the confidence that the tools don't matter too much anyway since the underlying files are stored in evergreen formats.
what's the optimal setup for long-term, large-scale (personal) data storage?
I want to build one big Backup. Some initial research has pointed me to something like Bacula to manage the data backup process from a machine. With the 3-2-1 rule, I know I also need my Backup itself to have at least 3 copies, in at least 2 different forms (cloud/hard disk), at least one of which is off-site from me.
As an individual, do you or anybody else know the best way to implement such a system? Should I buy one giant hard drive, use many hard drives to create a RAID array, something else?
Oooh. I've been wrestling with this problem for a while now.
Basically I'm working on a tiered system. Files/dirs are categorized by size (<10MB, <25GB, >25GB) , and by sensitivity (public, confidential, secure. And importance is usually proportional to security). I have fortunately found that security is usually inverse to size. Github/lab anything which makes sense. Confidential small stuff (sans keys) is just stored in gmail/drive. Big, boring stuff (music, ebooks) is just kept on external hard drives.
Secure, ultra-important stuff, I don't really have a system for.
The system I'm leaning towards is just encrypt archives and store the key/password securely, and store it like you would any boring data, with a local NAS and a cloud backup service of some sort, or just stored on drives offsite.
Do you feel comfortable using cloud storage for so much of your content? My ideal is to be entirely self-backed-up. I want a personal git server, photo archive, etc. With bandwidth, service costs, vendor issues (dealing with google seems like a nightmare from reading online).
How did you construct your NAS? Is it a single system, or multiple hard drives/storage solutions connected to your network?
It depends. Github is not going down. Gmail is not going down. If they do, it's Bug-out-bag time, and I am working on curating what information subset I need for that.
Ideally though yes I would have my own entire backup system but I frankly don't trust myself enough to do it right, so hence some redundancy in the cloud.
You mention S3 and Athena, but also that you're building for longevity. Are you planning for the future obsolescence of AWS, or going to cross that bridge when you get to it?
The S3 files are mirrored to a local drive as a collection of plain .md, .jpg, etc. The Athena search index is secondary in importance to the source data and not necessarily permanent (presumably the options for "take this folder full of files and let me search it" will only improve over time).
That being said, one of the reasons I chose S3 vs. other AWS services or other companies is because I expect it to be around for a very long time. (Just because I've preserved the option of migrating away doesn't mean I relish the idea.)
In light of this, I'm biasing toward simple file formats managed by tools I write myself, and optimizing for cost in a way that I otherwise don't, since any recurring costs incurred by the system are effectively a lifelong commitment. I am relying on S3 for primary storage (so that it is accessible anywhere) but with a sync to offline backup.
So far, I've implemented a personal Zettelkasten tool (with built-in spaced repetition, so doubles as an Anki replacement) and a search engine that's based on Presto (via AWS Athena) so that I don't need to keep an Elasticsearch instance alive. I'm planning to build out other repository tools as I go.
It's been very liberating to build tools that are never meant to be used by anyone other than myself, and with the confidence that the tools don't matter too much anyway since the underlying files are stored in evergreen formats.