I don't understand the focus on commit messages. I never read the git log. You can't assume anyone reading the code has access to the commit history or the time to read it. The codebase itself should contain any important documentation.
Well, no, because 1. it's not useful, because 2. most people never write anything useful there (which is a two-part vicious cycle), and 3. editors don't usefully surface it.
If we fix #3; and then individual projects fix #2 for themselves with contribution policies that enforce writing good commit messages; then #1 will no longer be true.
> You can't assume anyone reading the code has access to the commit history or the time to read it.
You can if you're the project's core maintainer/architect/whoever decides how people contribute to your software (in a private IT company, the CTO, I suppose.) You get to decide how to onboard people onto your project. And you get to decide what balance there will be between the amount of time new devs waste learning an impenetrable codebase, vs. the amount of time existing devs "waste" making the codebase more lucid by explaining their changes.
> The codebase itself should contain any important documentation.
My entire point is that commit messages are part of "the codebase" — that "the codebase" is the SCM repo, not some particular denuded snapshot of an SCM checkout. And that both humans and software should take more advantage of — even rely upon! — that fact.
I've been in enough projects that changed version control systems, had to restart the version control from a snapshot for whatever reason (data loss, performance issues with tools due to the commit history, etc) that I wouldn't want to take this approach.
> amount of time new devs waste learning an impenetrable codebase, vs. the amount of time existing devs "waste" making the codebase more lucid by explaining their changes.
That's a false dichotomy. The codebase won't be impenetrable if there are appropriate comments in it. In my experience time would be better spent making the codebase more lucid in the source code than an external commit history. The commit messages should be good too but I only rely on them when something is impossible to understand without digging through and finding the associated ticket/motivation, which is a bad state to be in, so at that point a comment is added. Of course good commit messages are fine too, none of this precludes them.
Agreed on most points, but a good SCM provides also the ability to bisect bugs and to show context that is hard to capture by explicit comments. E.g. what changed at the same time as some other piece of code. What was changed by the same person a few days before and after some line got introduced.
Regarding your first point:
> I've been in enough projects that changed version control systems
I have the impression that with the introduction of git, it became suddenly en-vogue to have tools to migrate history from one SCM to another. Therefore, I wouldn't settle on restarting from a snapshot anymore.
With git you can cut off history that is too old but weighs down the tools. You can simplify older history for example, while keeping newer history as it is. That is of course not easy but it can be done with git and some scripting.
> I've been in enough projects that changed version control systems, had to restart the version control from a snapshot for whatever reason (data loss, performance issues with tools due to the commit history, etc) that I wouldn't want to take this approach.
When was that? I've never seen that in 15-20 years of software development; I've seen plenty of projects change VCS but they always had the option of preserving the history.
Sure it's there, but having to wade through a large history of experiments tried and failed when trying to answer why this thing is here right now just feels subpar. Definitely sometimes it's helpful to read the commits which introduced a behavior, but that feels like a fallback when reading the code as it exists now. It works, but is much slower.
> Sure it's there, but having to wade through a large history of experiments tried and failed when trying to answer why this thing is here right now just feels subpar
That is one the downsides of trunk-based developments. One keeps a history of all failed experiments, the usefulness of the commit history deteriorates. That is for reading commit messages as well as for bisecting bugs.
> In other words: don't think of it as "meticulously grooming a commit history"; instead, think of it as your actual job not being "developer", but rather, as you (and all your coworkers) being the writers and editors of a programming textbook about the process of building program X... which happens to compile to program X.
If you have to "wade through experiments" to read the commit history, that means that the commit history hasn't had a structural editing pass applied to it.
Again: think of your job as writing and editing a textbook on the process of writing your program. As such, the commit history is an entirely mutable object — and, in fact, the product.
Your job as editor of the commit-history is, like the job of an editor of a book, to rearrange the work (through rebasing) into a "narrative" that presents each new feature or aspect of the codebase as a single, well-documented, cohesive commit or sequence of commits.
(If you've ever read a programming book that presents itself as a Socratic dialogue — e.g. Uncle Bob's The Clean Coder — each feature should get its own chapter, and each commit its own discussion and reflected code change.)
Experiments? If they don't contribute to the "narrative" of the evolution of the codebase — helping you to understand what comes later — then get rid of them. If they do contribute, then keep them: you'll want to have read about them.
Features gradually introduced over hundreds of commits? Move the commits together so that the feature happens all at once; squash commits that can't be understood without one-another into single commits; break commits that can be understood as separate "steps" into separate commits.
After factoring hunks that should have been independent out into their own commits, squashing commits with their revert-commits, etc., your commit history, concatenated into a file, should literally be a readable literate-programming metaprogram that you read as a textbook, that when executed, generates your codebase. While also still serving as a commit history!
(Keeping in mind that you still have all the other things living immutably in your SCM — dead experiments in feature branches; a develop branch that immutably reflects the order things were merged in; etc. It's only the main branch that is groomed in this fashion. But this groomed main branch is also the base for new development branches. Which works because nobody is `git merge`ing to main. Like LKML, the output-artifact of a development branch should be a hand-groomed patchset.)
And, like I said, this is all strictly inferior to an approach that actually involves literate programming of a metaprogram of codebase migrations — because, by using git commit-history in this way, you're gaining a narrative view of your codebase, but you're losing the ability to use git commits to track the "process of editing the history of the process of developing the program." Whereas, if you are actually committing the narrative as the content of the commits, then the "process of editing the history" is tracked in the regular git commits of the repo — which themselves require no grooming for presentation.
But "literate programming of a metaprogram that generates the final codebase" can only work if you have editor support for live-generating+viewing the final codebase side-by-side with your edits to the metaprogram. Otherwise it's an impenetrably-thick layer of indirection — the same reason Aspect-Oriented Programming never took off as a paradigm. Whereas "grooming your commit history into a textbook" doesn't require any tooling that doesn't already exist, and can be done today, by any project willing to adopt contribution policies to make it tenable.
---
Or, to put this all another way:
Imagine there is an existing codebase in an SCM, and you're a technical writer trying to tell the story of the development of that codebase in textbook form.
Being technically-minded, you'd create a new git repo for the source code of your textbook — and then begin wading through the messy, un-groomed commit history of the original codebase, to refactor that "narrative" into one that can be clearly presented in textbook form. Your work on each chapter would become commits into your book's repo. When you find a new part of the story you want to tell, across several chapters, you'd make a feature branch in your book's repo to experiment with modifying the chapters to weave in mentions of this side-story. Etc.
Presuming you finish writing this textbook, and publish it, anyone being onboarded to the codebase itself would then be well-advised to first read your textbook, rather than trying to first read the codebase itself. (They wouldn't need to ever see the git history of your textbook, though; that's inside-baseball to them, only relevant to you and any co-editors.)
Now imagine that "writing the textbook that should be read to understand the code in place of reading the code itself" is part of the job of developing the program; that the same SCM repo is used to store both the codebase and this textbook; and that, in fact, the same textual content has to be developed by the same people under the constraints of solving both problems in a DRY fashion. How would you do it?
Imho, you are missing out on a great source of insight. When I want to understand some piece of code, I usually start navigating the git log from git blame.
Even just one-line commit messages that refer to a ticket can help understanding tremendously.
Even the output of git blame itself is helpful. You can see which lines changed together in which order. You see, which colleague to ask questions.