Git-Annex
git-annex can be a bit hard to grasp, so I suggest to create a throw-away repository, following the walkthrough[1] and try things out. See also workflows[2].
I wonder a bit whether that is ZFS, or git-annex, or maybe my disk, or sth else.
Docs and other files you often change is completly different story. This is where DVFS shines. I wrote my own very simple DVFS exacly for that case. You just create directory, init repo manager.. and vioala.. Disk wide VFS is kinda useless as most of your data there just sits..
I wrote a bit about why in the readme (see archiving vs backup). In my opinion, syncing, snapshots, and backup tools like restic are great but fundamentally solve a different problem from what I want out of an archive tool like aegis, git-annex, or boar[2].
I want my backups to be automatic and transparent, for that restic is a great tool. But for my photos, my important documents and other immutable data, I want to manually accept or reject any change that happens to them, since I might not always notice when something changes. For example if I fat finger an rm, or a bug in a program overrides something and I don't notice.
So, what solution would be better for that? In the end it seems that other solutions provide a similar set of features. E.g. Syncthing.
But what's the downside with Git-annex over Syncthing or other solutions?
As I said, handling immutable data (incremental) is easy. You just copy and sync. Kinda trival. The problem I had personaly was all the importand docs (and similar) files I work on. First, I wanted snapshots and history, in case of some mistake or failure. Data checksuming, because they are importand. Also, full peer2peer syncing because I have desktop, servers, VMs, laptop, so I want to sync data around. And because I really like GIT, great tool for VCS, I wanted something similar but for generic binary data. Hence I interested in DVFS system. First I wanted full blown mountable DVFS system, but that is complicated and much harder to make it portable.. Repository aproach is easy to implement and is portable (Cygwin, Linux, UNIX, Posix). Works like a charm.
As for downside, If you think git-annex will work for you, just use it :) For me, it was far too complicated (too much moving parts) even for my DVFS usecase. For immutable data is absolutly overkill, to keep 100s of GBs of data there. I just sync :)
The problem is performance in some use cases, but I don't see anything fundamentally wrong with using git for sync.
Why the heck are you using (D)VFS on your immutable data?
Git-annex does not put your data in Git. What it tracks using Git is what’s available where, updating that data on an eventually consistent basis whenever two storage sites come into contact. It also borrows Git functionality for tracking moves, renames, etc. The object-storage parts, on the other hand, are essentially a separate content-addressable store from the normal one Git uses for its objects.
(The concrete form of a git-annex worktree is a Git-tracked tree of symlinks pointing to .git/annex/objects under the repo root, where the actual data is stored as read-only files, plus location-tracking data indexed by object hash in a separate branch called “git-annex”, which the git-annex commands manipulate using special merge strategies.)
However, I mostly use annex as a way to archive stuff and make sure I have enough copies in distinct physical locations. So for photos I now just tar them up with one .tar file per family member per year. This works fine for for me for any data I want to keep safe but don't need to access directly very often.
It's quite safe to just statically link most, if not all of them directly into the application, even when some of them are shared by other applications. I have seen this complaint repeated a few times. The reply from the Haskelliers seem to be that this is for the fine grained modularity of the library ecosystem. But why do they treat it like everything starts and ends with Haskell? Sometimes, there are other priorities like system administration. None of the other compiled languages have this problem - Rust, Go, Zig, ... Even plain old C and C++ aren't this frustrating with dependencies.
I need to clarify that I'm not hostile towards the Haskell language, its ecosystem and its users. It's something I plan to learn myself. But why does this problem exist? And is there a solution?
I think it's more of the distro's maintainers' choice. For Solus the amount of dependencies was just too much for me to handle so we resorted to static linking.
There ARE statically linked Haskell packages in the AUR so it's at least feasible. I haven't even dug into the conversations around why packagers are insisting on dynamic linking of distro packages - I just avoid them for the same reasons you mention.
I can't really speak confidently to why it is exactly - I can only guess. Clearly dynamic linking makes sense in a lot of cases for internal application distribution - which is where Haskell is often used - so maybe people are incorrectly projecting that onto distro packages?
Digging a little bit I found that git-annex is coded in haskell (not a fan) and seems to be 50% slower (expected from haskell but also only 1 source so far so not really reliable).
I don't see appeal of the complexity of the commands, they probably serve a purpose. Once you opened a .gitattributes from git-LFS you pretty much know all you need and you barely need any commands anymore.
Also I like how setting up a .gitattribute makes everything transparent the same way .gitignore works. I don't see any equivalent with git-annex.
Lastly any "tutorial" or guide about git-annex that won't show me an equivalent of 'git lfs ls-files' will definitely not appeal to me. I'm a big user of 'git status' and 'git lfs ls-files' to check/re-check everything.
E.g. if you drop something it'll by default check the remotes it has access to for that content in real time, it can be many orders of magnitude faster to use --fast etc., to (somewhat unsafely) skip all that and trust whatever metadata you have a local copy of.
I'm not sure what you are doing, but from looking at the git-lfs-ls-files manpage `git annex list --in here` is likely what you want?
You can apparently do, sort of, but not really, the same thing git-fetch-file[1] does, with git-annex:
git fetch-file add https://github.com/icculus/physfs.git "**" lib/physfs-main
git fetch-file pull
`add` creates this at `.git-remote-files`: [file "**"]
commit = 9d18d36b5a5207b72f473f05e1b2834e347d8144
target = lib/physfs-main
repository = https://github.com/icculus/physfs
branch = main
But git-annex's documentation goes on and on about a bunch of commands I don't really want to read about, whereas those two lines and that .git-remote-files manifest just told you what git-fetch-file does.My use case is mostly ETL related, where I want to pull all customers data (enterprise customer) so I can process them. But also keep the data updated, hence pull?
Unfortunately this is not (yet?) supported I think. But you could also just do something like this: `rclone copy/sync <sharepoint/dropbox-remote>: ./<local-directory> && git annex add ./<local-directory> && git commit -m <message>`.
I tried using it for syncing large files in a collaborative repository, and the use of "magic" branches didn't seem to scale well.
Back (long time ago) when I was looking into this, there was no KISS, out-of-the-box way to manage the Git Annex operations a Git user would be allowed to perform. Gitolite (or whatever Git platform of choice) can address access control concerns for regular Git pushes, but there is no way to define policies on Git Annex operations (configuration, storage management).
Might not be super hard to create a Gitolite plugin to address these, but ultimately for my use-case it wasn’t worth the effort (I didn’t really need shared Git Annex repos). Do you tackle these concerns somehow? I guess if people don’t interact with your repositories via Git/SSH but only through some custom UI, you might deal with it there.
Does anyone know if the situation has improved on that front in the past 5 years?
FD: I have contributed to git-lfs.
Here is a talk by a person who adores it: Yann Büchau: Staying in Control of your Scientific Data with Git Annex https://www.youtube.com/watch?v=IdRUsn-zB2s
And an interview When power is low, I often hack in the evenings by lantern light. https://usesthis.com/interviews/joey.hess/
I.e. annex is really in a different problem space than "big files in git", despite the obvious overlap.
A good way to think about it is that git-annex is sort of a git-native and distributed solution to the storage problem at the "other side" ("server side") of something like LFS, and to reason about it from there.
I dislike git-annex that much.
- it converts your files into blobs and bloats your file system
- As others have previously alluded, my primary use case is to ensure sync between distributed files, not version them (why would anyone possibly need that??)
- You can use AI to build a python based solution that will hash your files and put them into a lookup table, then create some helper methods to sync sources using rclone
Far simpler and more efficient methods exist.