Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How do I safely store my files? (photostructure.com)
152 points by dsego on Jan 27, 2021 | hide | past | favorite | 132 comments


Hollywood still archives productions on film. Specifically, color separated film: three rolls of B/W PET-base film, one for each of red, green and blue. These are expected to last 100+ years, are trivial to read and have color stability beyond any kind of color film (because the color information is encoded in black and white films instead of pigments) and offer more than sufficient resolution (they aren't prints, but made using laser film recorders). This sounds expensive but is apparently ~one order of magnitude cheaper long term than digital "archival".

If companies with Hollywood funding consider digital archival too expensive and too fragile... should really tell you something.

Other long-term archives are not digital, either. Germany's Barbarastollen houses thousands of stainless steel barrels with black and white microfilms - about a billion of them.


You think they would've learned their lesson? https://en.wikipedia.org/wiki/2008_Universal_Studios_fire


Just because you're archiving on physical media, does not prevent you from keeping multiple copies.


>Hollywood still archives productions on film. [...] If companies with Hollywood funding consider digital archival too expensive and too fragile.

I'm ignorant of the movie industry. Are you also saying that movies that were 100% digital capture (e.g. RED CINEMA digital cameras) are also "printed" to film stock to be archived long term?


From this excerpt about project Silica, it would seem that they are currently using hard drives and data tapes for digital. https://youtu.be/fzWbnXHEydU?t=124


I will take this opportunity to highly recommend UNRAID. You can use an old computer, add in some internal hard drives, and boot it off a USB stick.

Drives can be added or subtracted at any time so the system grows as your data grows and unlike traditional RAID you can mix and match drives of different types and sizes. Because it is not a real RAID implementation you can pull out any data drive in your array and read the files residing on that drive in another system. Data is protected against drive failure by single or dual parity drives and using SSD's for caching is even supported.

For anyone that is mildly technically inclined and enjoys DIY solutions, I would recommend at least checking out UNRAID before purchasing a Synology NAS (or similar).


Author here! I actually hadn't even heard of Unraid until about a year ago, when some of my beta users were asking for help with their docker setup, and when I asked their host OS, they said "UnRAID".

The community is supportive, templates are created and updated quickly, and there are several youtubers that make great, easy-to-follow videos for care and feeding your server. +1.


Another one to look at is snapraid [1]. Open source, also supports mix of disk sizes, adding/removing, and drives are usable alone. Negatives are it does scheduled (not real-time) parity, and you have to run mergerfs yourself on top of you want a consistent view.

I've been using it for a few years, and it's survived one drive failure (which I replaced with a disk twice the size). I originally had it on a Ubuntu server but have since migrated to running in openmediavault (which does all the setup for you), on a vm within proxmox server.

[1] https://www.snapraid.it/


Author here: I actually added SnapRAID to the article a while back due to some of my beta testers singing its praises. I haven't used it myself, but I'm glad it's saved your data!


Personally I like homebrew better https://blog.haschek.at/2020/the-perfect-file-server.html

Because I don't need web interfaces or more attack surfaces


> If [a disk fails] you should replace the disk as soon as possible.

Yet this setup doesn't notify you of that happening, so you'll have to manually keep an eye out...


Agreed! I built an Unraid box last year and it’s been a great all around computing machine to have around the house. NAS + virtual machines + docker apps, really does it all.


My unraid box I bought from Tom is sitting in my closet has been across the country 3 times now, is about 13 years old, has 14 drives in it now and still runs great. I love it.


The Unraid pricing page has this curious request in bold:

> Please do not use a comcast.net email address to purchase a license key.

Anyone know what the problem with comcast.net email addresses is? I know that in general it is not great to use addresses tied to your ISP for long term things, but that that applies to most ISPs so I’m guessing the issue is something specific to Comcast.


As a side note: we are all adults. What is so hard about stating

> Please do not use a comcast.net email address to purchase a license key, they blackhole our email.

Super simple and short (assuming that explanation from a sibling comment is correct). People are much more likely to oblige/agree when the reason for a rule is stated.


Been a while since i saw it, but i believe that comcast blackholed all their emails at one point (might still do so)


Would you recommend it even over FreeNAS?


Yes, because of the drive flexibility. FreeNAS uses traditional RAID.


FreeNAS uses ZFS. There isn't anthing traditional RAID about it.

You can expand it but not by adding a single disk to a parity set (going from five drives to six), only by adding a set of disks (adding another set of five or a similar combination).


I'm well aware of ZFS and have used it many times. To noobs the world can be separated into "RAID" and "non-RAID" type setups. UNRAID (non-RAID) offers much more drive flexibility which makes it cheaper to start and add to over time compared to ZFS.


You have obviously no clue what FreeBSD/FreeNAS and ZFS is. I always recommend a Battle hardened Filesystem/Volume-manager (ZFS/Lustre in Los Alamos) over some proprietary software.


I'm well aware of ZFS and have used it many times. To noobs the world can be separated into "RAID" and "non-RAID" type setups. UNRAID (non-RAID) offers much more drive flexibility which makes it cheaper to start and add to over time compared to ZFS.


> A printed photo in an album can easily last 20+ years without deterioration.

This first sentence is a big one. For a while now we have been making a yearly printed album of our favorite family photos. The first one was a gift for our parents but we liked it so much we continued the project.

When close friends and relatives visit, it is much more likely that we'll pull an album off the shelf and enjoy the memories together than we would digitally go through a year's worth of pictures.

It's a fun family project and you end up with a nice curated keepsake that will last a long time.

I also store a copy of each album sealed in a mylar bag which should keep them nicely for any curious decedents long from now.


I love how I stumble over photo prints from time to time in my home. When I stumble over a hard disk drive or a USB stick I usually don't look what's on it, because I'd have to boot up my PC, etc.


In case you're interested in how much traffic you get from a front-page post on some random Tuesday evening:

https://forum.photostructure.com/t/front-page-of-hacker-news...


Author here: fun to see this on the home page!

I had so many of my family, friends, and beta users ask me this question that I decided to do some research and write it up.

If you've got suggestions, or find any bit confusing, I'm all ears.


So why the recommendation for 4 or more drives? Aren't 2 drives enough in a mirror or are you recommending minimum 2 drive redundancy RAID-6 / SHR2?

Getting close to upgrading my old 2 drive Synology at home that has been running non stop for about 7 years without fail and was just going to do another 2 drive with some new large capacity drives (Synology 720+).


I have a Synology single drive NAS it works great (OK CPU is underpowered). I have primary copy on my PC. Second copy on NAS. And NAS backs up to cloud (AWS Glacier). Multiple drive NAS have the redundancy and speed but I really dont need that, the single drive is cheaper. Every few years I'll upgrade PC drive and give my old HD to my Dad as a just in case backup.


Great question: it's for expandability.

If you run RAID-1 in a 2-slot NAS, and you run out of space, you've got two options:

1) buy another NAS

2) move off of RAID, and have 2 distinct volumes. You'll have twice the storage, but no spindle redundancy.

If you've got 4 or 5 slots, you can throw in 2 larger disks, and use the prior disks either for backup, or use the new disks for incremental additional storage. You've just got a ton more flexibility.


So you are just recommending a NAS with at least 4 drive bays, not necessarily 4 drives, I read this and interpreted it as 4 drive minimum for data redundancy:

"Consider getting a NAS that has 4 or more drives to offer redundancy and support data integrity checks"

I may go for a 4 bay, but I never upgraded capacity in the last 7 years (3TB mirrored) and would probably go 14TB mirrored on the new one, so seems like a waste.


> I never upgraded capacity in the last 7 years (3TB mirrored) and would probably go 14TB mirrored on the new one, so seems like a waste.

Wow, yeah, if your storage growth is that stable, and there's a big cost penalty for 4+ slots, then a 2 slot NAS would seem like a good idea.


Yeah movie collection not growing, thats all streaming now, for new stuff. Photos in iCloud so on multiple devices, but may go to Synology backup with new one. Documents are nothing.

Main thing I will use more capacity for will be more surveillance camera footage retention and more and higher res cameras.


Ah crap that should say slots, not drives, my bad. Thanks for bringing this to my attention!


If you're on Synology, the Synology Moments iOS/Android app supports auto uploading photos to your NAS. The DS File app supports uploading them to Photo Station.


The JPEG bitrot interstitial got me thinking... I wonder how difficult it would be for an algorithm to correct that? The corrupted image doesn't look anything like a normal image, so, just try flipping bits and see what results in something more photographic. And maybe present multiple options to the user if the algorithm isn't confident.


That's really interesting: it's pretty apparent in most images when an 8x8 block is borked. It could just randomly flip bits to minimize entropy in both his and lightness channels.


If you have an EXIF thumbnail in the file, you could use it as a reference image. Although a codec block in the main image will probably be one pixel or less in the thumbnail.


You can try to flip bit to remove entropy, e.g. make the JPEG blocks less prominent. However unless you happen to know how many bits to flip, or are lucky enough that only low frequency information is ever destroyed, this operation is more akin to blurring than to magic information retrieval.


We may not know how many bit flips occurred, but we can probably assume it's in the low single digits, right? So, try flipping each bit, and see if any one of them results in a very large improvement. If there are no matches, try every combination of exactly two bits, then every combination of three, and so on.

Would that result in blurring? Perhaps the example shown in the article (where each individual bit flip resulted in major, obviously abnormal changes) was atypical?


> We may not know how many bit flips occurred, but we can probably assume it's in the low single digits, right? So, try flipping each bit, and see if any one of them results in a very large improvement. If there are no matches, try every combination of exactly two bits, then every combination of three, and so on.

There are almost 10²⁰ ways to flip 3 bits in 1 MB. This approach seems unfeasible if any more than 1 bit has been flipped.


Yeah, I guess I didn't do the math on that.

Even the ability to correct single bit flips would be useful though. I can also imagine much faster ways to do it... for instance, since the tops of the images are fine, the problem is likely in the 8x8 block where the corruption began (and the user could point to this).


Have we found a new, and actually useful proof of work algorithm?


There could be fewer than you think or the picture could be (locally) sharper than usual, and you'll just blur it.


There are neural nets that can "enhance" images and some recolorization software (for black and white footage from years ago) will actually improve resolution.


"it would take 120 BD-R disks, and $1,400 in media, using $12 M-Discs"

May be right but mdisc grade storage is way more better on the long run than a simple hdd. Accidental deletion, failures, ransomware are all much larger problems than many thinks when it comes to mediums other than the read only optical disks.

Each time that optical disks are considered, they talk only about the organic dyes of the Cd-r-s. HTL blu rays don't have issues like this, and mdisc blue rays are even more durable.


Band storage could also be an option, I heard from a friend that those (both drives and media) are way cheaper these days.


Agreed, and you don't have to burn everything to BR discs, only the most important stuff.

I've recently bought a synology to store my photos. But I'm questioning my reasoning for this at the moment. The upside is that I have a central place to store photos so they aren't scattered around on different laptops and external drives.

But the downside is that it's all in one place and even with backups it's more vulnerable to a single point failure.

So maybe there is some merit to storing photos on smaller external drives and having a few cheap 500gb drives for each year (and make some copies of course). And then maybe pick the best photos and burn to some HTL BR discs. The biggest problem with external drives is incompatible filesystems and fat32/exfat being susceptible to corruption.

Because honestly I'm not going to take another photo in the year 2020, am I?

So instead of diligently backing up my NAS, securing it from hackers and viruses, hoping btrfs devs wrote unit tests etc (1), I can make a few redundant copies on different hard drives and usb sticks and leave some at my parents house etc.

Theoria Apophasis, a youtuber ("the angry photographer"), has a few rants about long term data storage and archiving. He is a bit eccentric but imho makes some good points. Btw, don't watch if you don't want to have nightmares about your hard disk dying any minute now, ignorance is bliss.

- Methodology to protect your data. Backups vs. Archives. https://discussions.apple.com/docs/DOC-6031

- DO NOT LOSE YOUR DATA! Please be wise about DATA STORAGE https://youtu.be/o99hwegrvJQ

- Data-God Photographer: Part 1: Backups, Archives & Redundancies, Hard Drives & Optical https://youtu.be/scLMP9gm--M

- Angry Photographer: Video 1. HOW Hard Drives WORK, why you MUST fear & hate them https://youtu.be/uKGsNoUZAO8

- The ONLY SOURCE for long-term Archival DATA PROTECTION is... https://youtu.be/qbxaPc2Xf5M

(1) Tbh, I also have no faith in synology software, my user experience so far has been pretty disheartening. Photo station & moments are horrible, the dir is hard coded, you can't choose where your photos are stored which is just ridiculous. The backup software doesn't have an archive option for deleted files (like the CCC "Saftey Net" or rclone backup), it can only delete or keep the files. Also, for example the AFP protocol loses the file modified date: https://discussions.apple.com/thread/7547857

My understanding is that they're putting a shiny wrapper on some open source software and reselling for a premium. Basically this https://xkcd.com/2347/ I want to be that happy enthusiastic person in their promotional material, but the experience is not reassuring. And supposedly everything else on the market is even worse (plenty of Drobo horror stories online).


> Photo station & moments are horrible

Agree, they are unusable. I still don't understand the different user permission settings for photos and everything else. Phtos are somehow separate.

That said, I just don't use it and do all of that on external computers, using photo software.


Yes, and moments doesn't even have read-only access for viewing photos, which makes it a no-go for me.

https://community.synology.com/enu/forum/11/post/122691

And frankly this type of shenanigans makes me seriously doubt the competence of their software team.


And now I am trying the Suckology Cloud Sync with Google Drive and after pausing it I'm getting "Unknown error occurs. Please try again later". It's a piece of junk. This particular english phrasing really inspires confidence.


This is missing a key point: when you keep multiple copies, don't forget to use different file systems on each of those, to avoid a case where one file system corrupts the data similarly across copies.


I don't think it's a good idea to depend mainly on low-level or device-specific features (in a NAS or your filesystem) to detect bitrot and corruption in files. As a user, you don't care WHY a file changed, only that it did so unintentionally.

The filesystem cannot truly know if a file being modified was intended or accidental. Sometimes a software error can wipe or corrupt a file, and a filesystem can't detect that, even if it can detect bitflips.

The data architecture I use combines, in order of importance: (1) borg, for incremental deduplicated backups onsite and offsite; (2) syncthing, for duplication and synchronization across devices; (3) fim, for managing file integrity; (4) rsync, local copy to another drive for convenient restores; and (5) git, in places where commit history is actually useful, like code or dotfiles.

fim combined with incremental backup solves this problem, is filesystem agnostic and painless to migrate around, and is for some reason completely unknown and obscure: https://evrignaud.github.io/fim/


>The data architecture I use combines, in order of importance: (1) borg, for incremental deduplicated backups onsite and offsite; (2) syncthing, for duplication and synchronization across devices; (3) fim, for managing file integrity; (4) rsync, local copy to another drive for convenient restores; and (5) git, in places where commit history is actually useful, like code or dotfiles.

That looks overly complicated

Why is there no single tool that can do all of those?

I think one tool could be much more efficient. Especially since they could reuse the databases. With one database of hashes of each file, it could find the changed files and then incrementally only backup those. No need to have one program to search changed files to copy them, and then again have one program to search changed files to hash them

I mostly use rsync. Occasionally I destroy my backups by calling it wrongly, like missing a trailing slash. That could not happen if the copying and file integrity checking was combined in one tool

>fim combined with incremental backup solves this problem, is filesystem agnostic and painless to migrate around, and is for some reason completely unknown and obscure: https://evrignaud.github.io/fim/

That looks very useful

But it stores the database as json? That is a bad format to store file names. And written in Java it is probably going to be slow (I have over a million files in ~, and multiple copies of it in the backups, so I care a lot about performance)

So many emojis in the commit log. Is that really necessary nowadays?


> That looks overly complicated

In practice, it's simple enough. I have a backup script that does all the heavy lifting for borg and rsync. Syncthing operates independently in the background, and I rarely log changes with git. The most tedious part of the process is logging new/changed/deleted files with fim.

> Why is there no single tool that can do all of those?

Good question, and I totally agree. I was actually looking for a solution that could do deduplicated incremental backups (versioning, like time machine) plus file integrity management. I couldn't find one. Nobody seems to care much about file integrity, and nowadays I'm actually unsure if this is a real problem for anyone except paranoid geeks.

Could also use a good frontend GUI that could notify you about changed files, allow you to easily mark directories for different levels of "tracking", easily browse past versions and restore them, etc. The CLI isn't doing any favors here.

> I mostly use rsync. Occasionally I destroy my backups by calling it wrongly, like missing a trailing slash. That could not happen if the copying and file integrity checking was combined in one tool

Try borg-backup, it'll be a massive improvement over plain rsync when it comes to backups.

> But it stores the database as json? That is a bad format to store file names. And written in Java it is probably going to be slow (I have over a million files in ~, and multiple copies of it in the backups, so I care a lot about performance)

Gzipped json on-disk, but I haven't noticed any problems with it over the last few years. I doubt java is bottlenecking the performance at all: it's limited by IO and CPU throughput for hashing the files. For everyday use, you don't need to hash every file in entirety to detect changes. There's a "fast" mode that checks a couple of blocks. You can also operate in a subdirectory of the repo to ignore files outside that subdirectory.

I also don't just use one massive fim repository. I have a .fim/ for each directory I care about (types of data) that a secondary scripts loops through for checking the status. That weakens the integrity guarantees since you can't log the movement of files across directories, but makes it easier to reorganize things.

> So many emojis in the commit log. Is that really necessary nowadays?

Yah, that's the most emojis I've ever seen in a git repo. Maybe 2017 was a different era.


Fim is interesting, but I think most people (including myself here) don't worry about this because we rely on snapshots of backups. If a file was updated incorrectly then you just go back to the last known snapshot that worked.

Of course this is only as good as the interval you use for your snapshots. Fim seems like it would be a lot of extra work to do for every file/folder, but I could potentially see using Fim though for important files (e.g. tax records?, password vault?)


Fim works best for the types of files that don't change frequently, or shouldn't change at all. I use it for photos, digital art files, videos, music, books, records/receipts, etc. When I started, I put everything into fim, but there was too much noise to be useful.

Files that change frequently (password database) or won't totally bork out if there's a bit flip (text notes and org files) don't really need their integrity managed manually.


fim looks interesting. I skimmed through the linked page but couldn’t find an answer for a question I have — can fim be used for data stored elsewhere, by which I mean that the fim repository should be located outside the directory structure that it’s tracking? If I have a Photos directory with files and subdirectories, I don’t want that structure to have anything else (including fim).


There isn't a command line argument for pointing to a .fim/ located elsewhere, if that's what you mean. It probably wouldn't be hard to add that to the utility.

What you could do is put your photos subdirectory inside another directory, something like below.

    data/
    ├── .fim
    │   ├── settings.json
    │   └── states
    │       └── state_1.json.gz
    ├── photos
    └── records


I spent a long time devising a solid photo archive strategy.

For "current year" photos i store them on my NAS, backed up nightly, locally and remote.

Every year i then archive the previous years photos onto an identical pair of M-disc BDXL discs (100GB), and store one copy locally and one copy at a "temperature controlled" remote place. I use no compression, encryption or archiving. If there is degration of the physical media i want to minimize the damage.

Along with the archive BDXL media, i store an external drive which contains a complete copy of all photos. I also update this drive every year, and run a non destructive badblocks test on it as well as a long smart test before updating it. The drive is then updated, and rotated with the remote one (typically when storing new M-disc media), and i repeat the process for the retrieved drive.

Again, plain ext4 filesystem (i HOPE FAT32 will finally be dead in a couple of decades), no compression, archiving or encryption.

I never delete the photos from my NAS. The archive is simply "disaster recovery". I only archive photos. Any document will probably have little value in a decade or two.

I'm also fully prepared to migrate my entire archive onto whatever the "next big thing" in archiving becomes. No archives are truly forever. Optical media can go away, USB 2/3 will certainly become obsolete at some point. There's no point in having an archive i cannot access.

And yes, i have considered just creating physical copies of the images and storing them in identical photo albums, but sadly nobody "develops film" anymore, and everything is printed, which also degrades over time, and unlike film of old there are no negatives to reprint images from.


My family lost all our printed family albums a few years ago. We still can't get over it. Unfortunately my parents didn't understand the importance of keeping the negatives around. Printed photos degrade over time, you can see the colors shift. But more importantly, we didn't have multiple prints and all the albums were in one drawer in a bedroom. My brother decided to burn down all the photos in one of his fits of rage. But it could've happened in other ways, like the room burning down, an earthquake etc.

Btw, do you think there is a difference between 25GB single layer and 100GB M-Discs regarding the media robustness and safety of data?


> Btw, do you think there is a difference between 25GB single layer and 100GB M-Discs regarding the media robustness and safety of data?

As i understand it, the 100GB discs add another data layer, and assuming it works like all other optical media, this is achieved by angling the laser at a different angle, so logically a scratch on the physical media will incur read errors on another layer.

Apart from physical damage, the media density is higher.

I would assume that the 25GB discs have a longer lifetime than the 100GB ones, but I have no illusions that either of them will last a millennia. If the media is still readable 1000 years from now, let’s hope they still make optical drives to read them :-)

I’ll be happy if they last 10-20 years, and considering some of the early CD ROM disc that were burned are still readable despite using dye (which degrades) I’d say there’s a good chance of that either size will survive just fine.

I store mine in jewel cases, in a dark closet at room temperature. The worst enemies are temperature, light and humidity.


It would be great to see a breakdown ans gyidance to integrity checking of files through BTRFS/ZFS/ etc.

I have recently started using BTRFS snapshots on my playground server, and my god is this awesome. I get ZFS is quite special, but why is BTRFS not default FS?


> It would be great to see a breakdown and guidance to integrity checking of files through BTRFS/ZFS/ etc.

Like, if try to cause bitrot manually, and see how each recover?

There's this: https://askubuntu.com/questions/406463/how-can-i-flip-a-sing...

Incidently: I thought ZFS was ancient, and btrfs was a fairly new project, but I was wrong:

- ZFS was available in 2004: https://en.wikipedia.org/wiki/ZFS#Sun_Microsystems_(to_2010)

- btrfs was available in mainline linux in 2009: https://en.wikipedia.org/wiki/Btrfs#History

> why is BTRFS not default FS

Perhaps because there are still some sharp edges around btrfs, particularly around RAID56 support. See https://btrfs.wiki.kernel.org/index.php/Status for details.


BTRFS blew up under my feet in July 2019 (Archlinux, up-to-date at the time). Sure, I powered down the machine with its reset switch, but it borked the FS; it refused to mount RW and some of my files were inaccessible. Even though it had been some ~3 years with no issues whatsoever, btrfs still isn't up-to-par with ZFS' stability.

Good thing I had syncthing running.


My experience as well! I've tried btrfs as root filesystem (because I wanted checksummed root and btrfs is easily available out of the box) a few times every now and then, last time probably 1-2 years ago - invariably it always somehow f*cked itself up sooner or later.

In contrast, I never had a single non-hardware-related problem with ZFS in maybe decade of using it for NAS. It's totally rock solid! I've recently figured out how to do unattended root on ZFS setup to provision my PCs and never going back to btrfs no more.


Yikes, now you have me worried. I've recently set up btrfs on a Synology, mostly to get the advanced features like scrubbing and snapshots. But if it's too risky maybe I should've gone with ext4.


I wonder how hard it would be to do a survey to turn anecdotes into data, rather than trying to do some sort of test to simulate corruption.

Like, I've been using ZFS for about 10 years without a problem -- I had a disk in a mirrored setup go bad that I didn't notice for awhile, and when I replaced and rebuilt, I had a couple of files which had suffered corruption.

But that's just my personal anecdote; I'm quite confident there are people using BTFS who can quote similar track records, and I assume there are people who have seen ZFS shit the bed for one reason or another.


Meh, it's risky if you only have all your eggs in one basket. Just put multiple copies of your data everywhere, and don't think too much about it.

Also, maybe Synology will manage its drives better than I did in my multiple-purpose desktop machine.


Someone gave me a reply on a different HN post that synology btrfs is fine, since it's using a different raid implementation? Hope I understood that right.


I don't know if it ever got better, but in 2015 I lost multiple BTRFS file systems that we operated at near capacity limit, with a hefty churn of files. The FS wasn't shutdown without unmounting, or anything like that, but just used intensively. We had heard that you should keep plenty of unused space on BTRFS file systems, and this must be why.

The file systems couldn't be restored with the official suite of tools. It was a backup anyway and we restored the data from secondary backup.

I'm not positive that there is any other file system that I've ever lost significant data on.


The only reliable method to store data long term is to start a cult that is dedicated to ritualistically copying your data every year and confirming that it is still intact. No other method will last.


As much as I don't like using Google services, they do offer something I might actually use, as per the mention of 'offsite' offprem storage in the article:

https://cloud.google.com/storage/archival/

There is also Amazon Glacier

https://aws.amazon.com/glacier/

They market these services towards SMEs, but one-man-shows/solopreneurs can use them too.


I use extra parity added with par2. (1) It proves no bit rot has occurred (2) It can fix single bit errors up to completly recover missing photo files (3) Multiple implementations exist.


Is it a manual process to create the parity files? Right now I've set up a synology with mirroring and btrfs scrubbing. I'm wondering what more I can do to protect my photo library.


I'm not affiliated but have a look at https://github.com/brenthuisman/par2deep


A good cheap solution for me: par2 of your dataset, glacier deep archive, with s3 bucket versioning. If you want to be even more safe, turn on MFA delete on the s3 bucket.


I used to back up for long term storage on floppies. The on iomega drives, later on CDs and then on DVDs. My thinking was that what's on my hard disk today is volatile and the backup media is for the long haul.

In every case those became a problem over time either due to degradation of the media or simply because the underlying tech became more rare or unavailable.

So about ten years ago I flipped my approach. My long term storage is my current file server and for the backups I have no expectation of them lasting more than a year or two but that's fine.

The file server is up and running 24x7 so it is not bit-rotting in some drawer. It runs ZFS so integrity is guaranteed. ECC RAM. 4-way mirror for every pool so there's plenty of redundancy. The only drawback is having to stock it with enough storage to hold everything, but that's the tradeoff. Storage gets cheaper while backup media doesn't get particularly more reliable so it's a good tradeoff.

I still do backups of course (while ZFS and ECC guarantee integrity, nothing guarantees I won't accidentaly rm a file) with zfs snapshot and send, but these are only for disaster recovery not archival so I don't need to lose sleep worrying how long they last.


The most important part of long term storage/backup for photos I have foundn is: Use a write only storage. Do NOT use photo backups that rotate out old versions. Mirrors are not backup. If deleting or corrupting a file means it's eventually deleted or corrupted also in all backup copies, then it's not a good strategy.

Pick an off site backup system where you have enough space to store every version of every file, with infinite retention.


Eh, I disagree. I want to be able to eventually actually delete deleted files (to save on resources) more than I want to be able to retrieve files that I deleted a year back and didn't realize until now.


That can be handled by explicitly deleting it in the backup data or manually shifting out some old generation of backups.

What you want to avoid is the single most common error: you accidentally delete or corrupt something, then several years later you notice.

This might sound unlikely but it is I assume more likely than theft, fire, and other reasons we keep our backups off site.

Whether "retain all copies of everything, forever" is actually viable can vary with budget and the size of ones library of course. But storage is pretty cheap these days.


Hmm, I don't actually know of a way to delete a specific file from all backups in restic/borg, and it wouldn't be very convenient to do that for every single file I delete just to free storage.

Instead, I just set the retention policy to a year. Since most of my backups are photos and archival stuff like that, I rarely delete those without confirmation, so it's not a problem.

For other things like documents, I have Nextcloud, which does keep the last N versions, so it guards against the scenario you describe, true.


I moved a whole directory of photos once (a year basically) which accidentally landed it outside the tree that was backed up. This looked to the backup as if it was deleted and it was eventually shifted away. Since it was an old directory I never really noticed until it was too late. That’s why I now just pay to keep everything retained. I fit every photo I ever took with ease in a pretty cheap 2TB space.


I am afraid of this as well. It's easy to make mistakes and current software doesn't really offer assurances. Right now I sometimes visually compare source & destination directories. Like literally, make the windows the same size and scroll through them so I can notice if anything has visually shifted, which tells me files are missing. That's how I found some missing RAW files, but turns out those were taken on the fuji's auto settings which desn't store RAW even though I have it set up to take JPEG+RAW (many such idiosyncrasies with these cameras). I then almost had a heart attack when I realized that my camera filenames did a roll over from 9999 to 0001 and I wasn't paying attention. Luckily there was no overlap in that directory, but I could've easily clicked confirm to replace my earlier photos with new ones (DCIM better be replaced with something better soon).


Yeah, in that case it makes sense. I manage them using Capture One Pro, which only manages them inside their directory, so it's really unlikely that they'll get imported or moved out.

That said, I too have a few TB of online storage, but my problem is the reverse: I will sometimes capture a few GB of photos, import them for safety, the nightly backup will run and back them all up, and the next day I'll delete most of them as rejects. Those rejects, I don't really care to keep, so I wouldn't want blurry or otherwise useless photos taking up space for ever.


I use git-annex [https://git-annex.branchable.com/] for archiving, adding as many so-called remotes as I wish, automatically keeping track of what's stored where and for checking data integrity.

Support for borg remotes was recently added, which I think will be very useful.


My albums have survived disk crashes.

I try to keep it very simple. I use cron to rsync --backup to another computer and a NAS.

I do not use Raid. I want to be able to take any disk and mount it on any computer, possible using a USB adaptor.

There is no need for instant synchronization. When I upload pictures to the photo album, I wait a couple of days before deleting them from the camera.

I set up a screensaver on a desktop to display random pictures from the NAS backup. We like to see the pictures that way but it also have the advantage that we notice when the NAS is not working and I do check that I see a recent picture once in a while.

I now make certain that I use photo album software that organize pictures using directories and filenames. I once had a crash and recovered the actual photos but not the database with tags, names etc. So I had to name all pictures again.


> I want to be able to take any disk and mount it on any computer, possible using a USB adaptor.

That can be problem if you want to use different operating systems and don't want to use fat32/exfat.


Good point about bitrot. This is why I wrote ccheck.pl (https://github.com/jwr/ccheck) — I wanted to be able to check and detect bitrot in a way that depends on as little technology as possible.


What happens if the checksum file created by ccheck experiences bitrot?


It's GPG-signed.


I have an 8TB hard drive in my desktop and an 8TB hard drive in a low power server. Btrfs to presumably detect bitrot at the filesystem level. They run syncthing to keep in sync. Both machines have external hard disk drive bays, and every so often I'll manually copy all my files to a 3rd disk as a cold backup.

It seems like a good enough solution for me to not worry or think too hard about it. Syncthing isn't the most user friendly software though. When I upgraded from 2TB to 8TB drives, I accidentally a setting and ended up with about 1TB of duplicated "sync conflict" files. If I wasn't a salty software engineer comfortable writing arbitrarily complex scripts with free reign to delete files, it would have taken more than a weekend afternoon to clean that up...


When you say you "copy all my files to a 3rd disk as a cold backup", do you mean you are overwriting any existing files on the 3rd disk? If so, I would recommend considering using something like Duplicacy (https://github.com/gilbertchen/duplicacy) to perform your backups with snapshots.

The reason being is you can decide how long to keep older copies of files, just in case something changes on a file, but you don't happen to realize the incorrect update for a couple of months and want to go back to an earlier version.

To give you an idea of what is possible, here is the snapshot configuration I use: Keep no snapshots older than 450 days Keep 1 snapshot every 90 day(s) if older than 180 day(s) Keep 1 snapshot every 30 day(s) if older than 30 day(s) Keep 1 snapshot every 1 day(s) if older than 1 day(s)


I wipe the 3rd disk and then dumbly copy all my data to it again. My reasoning is that I want the complete volume of data actually copied as part of the backup activity. The source filesystem, being btrfs, should raise an error if there was bit rot, but that requires actually reading the file contents. If there is, I can grab the copy from the syncthing peer. The destination hard drive has to actually write all the file contents, so the magnetic fields should be all fresh.

The 3rd hard disk is a cold backup in case syncthing has a common mode failure that takes out my redundant drives. I want its filesystem to be as boring as possible. For me that currently means ext4. I don't want to have to manage and worry about another tool that optimizes away data transfers, when there's value in the data transfer and plenty of copy bandwidth going SATA to SATA.


Your current approach sounds fine from the perspective of bitrot. The hole in your approach right now is protecting from incorrect updates made either by yourself or by an application to your data that you don't notice immediately.

For example, say you are using a financial application like GnuCash on your desktop and it has bug in the software that gets triggered causing a bad write to the file. (power goes out, OOM, whatever..) syncthing will happily propagate that changed file that contains the bad write to your server and when you copy to the 3rd disk you will also propagate that changed file. You deleted what was on the third disk, so now you no longer have a "good copy" of the file.

If you add some sort of snapshotting into your backup routine, then you would be protected from this, because even though you would still propagate the change to the most recent backups, you could still go back to an older snapshot (maybe a week or a month ago or whatever) and pull back a working version of the file.


That certainly makes sense. In my case, I don't run any applications like that directly on my storage filesystem by default. Those sorts of files and directories, including my linux home directory, are on a non-redundant SSD. If there's a failure, I accept the risk of losing that data. 90% of the time the data either doesn't matter, or is redundantly stored somewhere else, such as a git repo. A typical use case is digital photos; I may dump them into my desktop directory at first so that I may view and sort them, and they'll live there for an arbitrary length of time. I simply won't delete them from their SD card until they're copied into the storage filesystem.

Also, I have enough disks laying around that I usually have a 2nd cold backup to wipe. So, I have some inadvertent temporal redundancy going back a year or two.

When I make a conscious decision to copy files over to my storage, I have piece of mind that I'll get physical redundancy within hours, and will get picked up by my cold backup eventually.


What low power server do you mean here? I use a similar setup, but I'm on Windows with a folder-to-folder sync tool to sync to an external HDD which is connected through USB. I do have a 3rd HDD do make occasionally cold standalone backups.


It's an Intel baytrail nano-ITX motherboard, with what I think is called a pico PSU, and an SSD for the operating system. It runs off a 12V 2A power brick. No fans or anything. The only component that isnt't solid state is the hard drive.

I used various arm single board computers previously, with the hard drive connected via USB. When I started using syncthing, though, it was literally taking days to hash through all the files, only going 1-2 MB/sec. The UI wouldn't work until hours after restarting the daemon because it took so long just to build its data structures in RAM. The baytrail CPU has enough oomph to tear through the hashing, and it's just nice to have expandable RAM and SATA.


> safely

You can't. But you can make multiple backups, each one reduces the chance of losing your files. Store at least one backup off site.

Also, media dies over time. Buy new hard disks annually and make them the new backups.


BTW, last fall I had a disk drive in a bubblewrap envelope on the edge of the counter. Accidentally knocked it onto the floor. Drive was dead after that.


I try to buy my hdds in bigger stores that get their wares on pallets from trucks. They get less mistreatment that way than in a small box from a webstore.


I put 45GB of newly-ripped CD MP3s (192k) on a close-to-new 80GB HD (Seagate) and put it in a closet shoebox. Twelve years later (2017) I put the drive into an external USB enclosure. It spun up and transferred all of the data with (so far as I've heard) nothing lost. (No doubt some bits were ... but no files.)

The old drive itself failed within a year.

So I'd feel pretty safe copying one drive to another every 5 years.

When it comes to digital images (I don't have enough to bother), archive-grade printing to paper seems like the obvious first-choice option -- IF it's done properly.


The yearly price to store 45 GB in Backblaze B2 is $2.10. After 12 years, your cost of storage-at-rest would total to $25.20.

Then, it would cost you $0.44 to download all 45 GB on the same day.

For $25.64 total (over 12 years), they store your data with significantly more redundancy than you when you put one copy of your dataset on one hard drive.

I chose B2 in this example because they're cheaper than blob storage from the main clouds, you're unlikely to get banned for an unrelated reason (cf. Google), and their pricing model is simple to understand.

Assuming ~120 MiB per compressed CD album, and ignoring the futzing about SI and Binary units, you can store at least 8 CDs worth of ~192 kbps music (~1 GB) in B2 for $0.005/month, and since the first 10 GB/month is free, your first 80 albums are stored for free. Then, each additional group of 8 albums is another $0.06/year.

If you're still unconvinced, and prefer the particular characteristics of control, convenience, and no direct monetary opex costs that personal self-managed storage affords, then consider that for an extra $25.64 over 12 years ($2.10 for storage-at-rest, yearly), you can have another copy of your 45 GB in the cloud, which significantly reduces the likelihood that your dataset is damaged.

You can even think of it as insurance, but with the extremely desirable property that you get your actual data back, and not just some other kind of compensation.


I generate photos and videos for one of my businesses and was backing up my laptop and external drives (connected regularly) to Backblaze. Or so I thought. When my critical 4TB external failed, I discovered that Backblaze hadn't had my system backed up for almost a year - there was a multi-TB backlog and at the throttled speed I had set for the sake of the office network, it would take 10+ years to catch up. Just something for everyone to stay aware of with cloud storage.

I currently back up to two drives at ingest, one being a per-year drive (e.g., 2020 jobs source media) that is rarely accessed. For major projects, I factor in one or two 1-2TB drives that receive an extra copy of source media and then are stashed. Not perfect, but so far so good.


Backblaze B2 is cloud storage, not backup afaik.


It's object storage, you can use it for either. Especially with adding other tools on top like Duplicacy or Rclone.


Kind of related, but I've had an idea for a while to make a user mode file system on Windows that uses a single backing file (maybe sqlite) in order to speed up builds since NTFS is very slow for large numbers of small files. Also would be nice to have integrity checking backed in. It amazes me how bad file systems are in 2021. No integrity checks, no transactions, slow/incorrect file watching, etc.

I know these features are provided by some newer file systems like zfs and btrfs, but those are not used by most consumers and are mostly used on servers.


I think part of it is the trade-off between durability and simplicity. 9 out of 10 users aren't going to be capable of configuring zfs or btrfs in a way that's actually more durable than FAT, NTFS or ext. When corruption inevitably occurs, the average person is much more likely to recover some of their data off of simpler filesystems than complex ones. "We saved 80% of your family photos by running FatFileFinderPro, that will be $250," is better to a consumer than, "Bummer, our highly advanced diagnostic tools say the drive is toast. That will be $250."

OEMs could choose to solve this for operating systems and maybe they do on internal storage, but it doesn't help with external storage, which is once again what 9 out of 10 people consider to be a "hard drive".

Hard drive manufactures therefore have a strong incentive to solve durability as best they can inside the drive units themselves. I don't actually worry about bitrot too much because I assume there's some sort of propietary error correction mechanism at the firmware level. I expect my data is either going to read out flawless, have gaping holes in it, or have the whole disk unit fail. I'm not going to blindly trust that such a mechanism exists, but it weighs into my risk analysis when making decisions on how to best protect my personal data.


Back in the day, one way to protect against bitrot specifically, was the .par2 format; You maintain a bit of parity at the file level and it can actively correct for that much bitrot in a backup file.


I've seen a nice tool par2deep that I've got on my list to checkout: https://github.com/brenthuisman/par2deep


Don't treat offsite as extra.

Don't treat retention as free. Its either your labour and disks or reliable files for an SLA but its $ no matter what.

Be philosophical when dateloss happens


I've only quickly perused the article so far, but it immediately reminded me of the following excellent article on ArsTechnica from 2014:

https://arstechnica.com/information-technology/2014/01/bitro...


Is there any common image file format with internal error correction embedded within it? Like a JPEG segment with LDPC data, for example.


A quote from the article: "Network-attached storage (NAS) devices hold several large hard drives and quietly do their work safely storing your files. You can keep using your favorite OS, but you don’t have to worry about bit rot anymore."

How exactly does a NAS protect data from bit rot? Data scrubbing?


Yes, that's why it is important to make sure you have scrubbing set up on a regular recurring interval.


https://perfectmediaserver.com/ Mentioned mergerFS which may also be of interest if going down the server path


Does anyone know what kind of advantages I get running "TrueNAS core" vs just running zfs off debian?

I assume a nice GUI to tune parameters?


FreeNAS was BSD based (but I've been too chicken to migrate to True as after reading stuff in the subreddit).

The GUI _is_ pretty slick.


I'm just running zfs out of the `apt` box at the moment. I'm assuming if I get to the point that I care about tuning it, I'll be able to. We'll see if that assumption bites me in the ass.


Using zfs on Linux (CentOS) proves to be on par with FreeBSD. They are using the same OpenZFS code afiak.


The FreeNAS/TrueNAS GUI is quite nice. It's a whole system designed for file storage. It uses ZFS, which is probably the safest file system for your data. I highly recommend it.


What about cloud storage like Dropbox ?


Ultimately what's interesting is that Dropbox has to deal with these same problems as well. Dropbox also stores your data on physical media that's susceptible to bitrot. By using cloud storage you're just trusting that Dropbox, Amazon, Google, whoever are savvy enough to handle it. Do you trust them?


I store all binary data I want to keep around with git-annex these days. It solves integrity checking (you have a hash for every file) and distributed storage. Feels like RAID for layer 7.


This is too much for almost all users. IMO the best method is to simply store them on google and keep a single local copy. Its almost impossible that google will lose your data since they keep multiple backups and the local copy will save you if they ever kick you off the service.


Unfortunately, having directly felt the brunt of Google AI deciding it needed to disable something about my account for no appreciable reason, this is simply not true.

Even if Google had reasonable customer support, relying only on a cloud backup is unwise.

UX being what it is, my aunt just lost most of her iCloud photos because she thought she was cleaning up her laptop to donate, but iCloud synchronized all the deletes up to her account.

My father "clicked the wrong button" and lost his entire Picasa library a while back.

More copies are better.


If you aunt deleted her pictures recently enough, she might be able to get them back: they're stored in a Recently Deleted folder for 30 days before they're really gone.

https://www.imyfone.com/ios-data-recovery/how-to-recover-pho...


Thanks!


The person you are replying to said to add a Google Drive to existing local storage. Sounds like you agree.


I don’t think so... the one local storage mentioned originally would not qualify as a backup.


Its a backup of whats on your google account. You set your phone to upload to google and once a month or so you download the backup.


Google can shut your account for no reason and with no warning.

Pricing plans and other terms of service can (and will) abruptly change.

There is no guarantee that Google will even exist 20 years from now.

If we are talking about data that is to be handed down for generations, I certainly wouldn't trust Google as a solution. I do agree though that setting up self-hosted local/offsite backup, the kind described in the article, is out of reach for most users. Maybe redundant copies on every cloud storage provider will do the trick?


I have already covered that case though. Google provides a tool to dump your data locally which they can never touch. Put all your photos on google first and then do a monthly backup to your local computer. Its verging on impossible that google will ban you and your hdd fail at the same time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: