Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PyPy has moved to Git, GitHub (pypy.org)
169 points by lumpa on Jan 1, 2024 | hide | past | favorite | 76 comments


> Open Source has become synonymous with GitHub, and we are too small to change that.

It's kind of sad that this is true.

I'm guilty myself, I contribute to projects on GitHub more often than on any other platform.

And when I search for open source projects the first page I use is GitHub.


A lot of this is SourceForge's fault.

They had a sizable lead and completely bungled it.


It's not just that they dropped the ball, they actively sabotaged whatever goodwill they had built by adding malware to software. Not only was this a massive hassle, it ruined the reputation of lots of FOSS projects with folks who just wanted to use some of the most popular consumer-ish open source software like Filezilla.

While SF was crapping where they eat, GitHub built a lot of trust and goodwill with a lot of people.


It should be noted that the malware bundling was done when SourceForge was owned by DHI Group, Inc. And now, for many years already, SourceForge has switched owners (BIZX/Slashdot). They have undone the bundling and are now trying to manage it like it was managed before. It seems like it's going well.

I would consider SF a viable Github alternative, but the bad reputation caused by a temporary owner just seems to stick forever.


The Sourceforge UI and overall experience is still a decade or more behind the experience on GitHub and most of the other modern and maintained equivalent sites/services like Fossil, GitLab, Gitea, etc

That ancient feeling UI doesn’t win them a lot of forgiveness, if the only notable change has been “we took away the malware” and the site continues to remain stagnant, SourceForge will continue to feel very inferior to more modern alternatives.


Just goes to show how much damage you can do to a brand by betraying your users. I still won't use SourceForge, fuck them.


SF's development was also essentially stalled during that period, and the current owner has to do much more things other than (proper) management to cope with competitors. If you need a proof, compare SF with Gitea/Forgejo.


Open Source = a Microsoft property... I don't want it to be true and I will act like it isn't true for things I have influence over - i.e. it's not the first choice for hosting a repository.

Some people don't even know the difference between Git and GitHub...


> Some people don't even know the difference between Git and GitHub...

I have also found this to be the case, even with engineers that have years of experience. It's both impressive and awful.


Be part of the solution, not part of the problem. You can use some other forge and keep an up-to-date github repository as a read-only front.


I don't host my own repositories on GitHub, I host a gitea instance myself.


What exactly is the problem?


Shouting under the rock, "Microsoft!"


I found Codeberg being good substitute


I will admit, I've often searched for $project_name github to get to their repositories. It shouldn't matter but it's just a force of habit now.

That said I do feel some joy when I see a project on Gitlab and am happy to contribute there, eg FDroid.


Is it? I don't know, I don't use it for personal (actually personal) stuff, but if I actually want to publish something — I'd do it there. If I want to contribute to something — it's way easier to do it on Github than most other places. It makes searching for code easier too. If Microsoft decides to abuse the monopoly, or if Gitlab/etc. would actually be much better, I don't imagine it would be very difficult to switch. Well, yeah, sure, Github Actions and issue history are somewhat of a vendor-lock, but it's not that bad, I suppose.

Maybe Copilot (made possible by the huge non-commercial codebase on Github) being somewhat unfair advantage to other commercial alternatives is a bit troublesome, yeah. But otherwise I just don't see why Github being a de-facto standard is bad. In fact, I am somewhat annoyed when a really popular project doesn't have a Github repository (mostly because it makes filing an issue, or even reading existing issues much more difficult in most cases). So I'm actually glad to hear that some big projects feel pressed to migrate to github. What's even a problem with that, apart, maybe, for github actions, that honestly suck?

(Maybe I should add: I am a git hater, and do think that mercurial is just unquestionably better, but this battle is lost long time ago, so I don't suppose it's the topic of this discussion.)


I love having this central location even though git is distributed, mainly because having to go to multiple Git hosts of varying quality would be a pain in the ass.


If you are confused (like me) that this was about PyPI (Python packages repository) then no. It is about a project called PyPy (one can argue it is bad name) that is an implementation of python interpreter but without cpython. Instead they rely on a JIT compiler. And it is syntax compatible but if your code uses any library or method relying on C extensions then you are out of luck (Goodbye NumPy.. etc).

Edit: They have C-layer emulation, but I don't know its limitations or current status, but you can use those libraries [1][2]

[1] https://www.pypy.org/posts/2018/09/inside-cpyext-why-emulati...

[2] https://pythoncapi.readthedocs.io/cpyext.html


They do support numpy. The pypy name predates pypi. You’re off on multiple details here.


PyPy being a Python JIT written in Python with an ouroboros as a logo is pretty much the perfect name.


to be fair, PyPy predates PyPI


Now all we need is a PiPy and we'll have all the pies


PiPi would complete the set but that would be (doubly) irrational…


I think you mean transcendental.


That works too. Perhaps that would be how the pie would taste: truly transcendental, but I don't think it could ever be finished.


How so? PyPI launched in 2003, PyPy's first release was in 2007. https://www.pypa.io/en/latest/history/#before-2013


PyPy was started early in 2003 too, the first release took a while. PyPI was branded as 'The Cheeseshop' in the early years.


Nit. I believe you can use numpy and at least some other libraries relying on native extensions but performance might vary: https://doc.pypy.org/en/latest/faq.html#should-i-install-num...


> one can argue it is bad name

Given the way that pypy is implemented, I think the name is quite clever really.


The packages repo is known as PyPI (like Py P.I.), not PyPi.


As a counterbalance, I'm very familiar with PyPy and have never heard "PyPI".

> one can argue it is bad name

I suppose one can, but it's a python interpreter written in python, so I think it's pretty good.


THANK YOU. This was my first reaction as well.


I've been using git happily for many years. Strangely enough the provenance of a commit i.e. which branch did a commit originally come has not really mattered to me very much. Mercurial provides this and they are using `git notes` to add this provenance meta-data to each commit during migration to git.

I would have thought I'd need this much more, but I have not. In plain git I'll just `git log` and grep for the commit in case I want to make sure a commit is available in a certain branch.


The point is giving branches a meaning (e.g. "implementation of this feature") and being able to at least keep the information that such commit was part of that (well at least that's why I'd want Mercurial's named branches, I'm not sure that's how this project used them)


Wouldn't good merge commit conventions work to preserve as much of this sort of information as desired? All the commits of the branch contained in it with the the merge commit message preserving that info.


When I want to see inside a piece of software I look for (1) the source code; (2) the git-blame; (3) the code review for significant commits. I have never wanted to see into the history before that point, namely how the developer drafted and polished their idea prior to the final code review approval.

What practical use case am I missing out on when these work-in-progress draft commits are lost? I can’t see one.


But 33% of PyPy packages contain the potential for extreme security flaws and you don't know which ones until it gets you. How bad do you have to want to use Python to tolerate that?

"“When we actually examined the behavior and looked for new attack vectors, we discovered that if you download a malicious package — just download it — it will automatically run on your computer,” he told SC Media in an interview from Israel. “So we tried to understand why, because for us the word download doesn’t necessarily mean that the code will automatically run.”

But for PyPi, it does. The commands required for both processes run a script, called pip, executes another file called setup.py, that is designed to provide a data structure for the package manager to understand how to handle the package. That script and process is also composed of Python code that runs automatically, meaning an attacker can insert and execute that malicious code on the device of anyone who downloads it." https://www.scmagazine.com/analysis/a-third-of-pypi-software...


PyPy is a Python implementation, not the same the PyPI (Python Package Index).


my bad .. handicapped by the way I auditize .. point remains the same tho. Need to clean up PyPI or stop the mortals from using PyTHON. In the meantime, maybe put your venv's into a single non-emulated vm.


Why do people like Mercurial branches? Was it revamped? I hate it when I used it.

By all means, I prefer Git branches.


There are benefits to having branches be an inherent property of a commit as opposed to the Git model of a dynamic property of the graph.

Suppose I have a branch A with three commits, and then I make another branch B on top of that with another few commits. The Git model essentially says that B consists of all commits that are an ancestor of B that aren't the ancestor of any other branch. But now I go and rebase A somewhere else--and as a result, B suddenly grew several extra commits on its branch because those commits are no longer on branch A. If I want to rebase B on the new A, well, those duplicated commits will cause me some amount of pain, pain that would go away if only git could remember that some of those commits are really just the old version of A.


> If I want to rebase B on the new A, well, those duplicated commits will cause me some amount of pain

Not really. Git will recognize commits that produce an identical diff and skip them. Your only pain will be that for each skipped commit, you will see a notification line in the output of your `git rebase`:

    warning: skipped previously applied commit <hash>


If I had a nasty rebase of A, then git isn't smart enough to figure out that the new A' commits are similar enough to the old commits to know to skip the old-A commits.


> There are benefits to having branches be an inherent property of a commit

And drawbacks, naturally. Advanced branching/merging workflows become extremely painful if not impossible, which makes mercurial unusable as a "true" DVCS (where everyone maintains a fork of the code and people trade PRs/merges).


> Advanced branching/merging workflows become extremely painful if not impossible

That's really, really not true. First off, I used the word "inherent", which doesn't mean "immutable"; you can retain all the benefits of mutability if you so desire. Of course, Mercurial historically focused a lot heavier on immutable commits than Git did, but hg eventually found a different path that really makes using git feel antediluvian in comparison.

The second thing to note is that there's no requirement that the 'branch' property of a commit correspond to only one head. Actually, I don't think any of the mercurial repositories I've contributed to ever bothered with branches; there's just simply no need in mercurial to create multiple named branches, the way there is in git.

Finally, mercurial solves the workflow problem in another way, by essentially realizing that there is a dichotomy between public, immutable commits and work-in-progress draft commits. The problem with PRs is that you end up in a situation where you have the unenviable choice between making updates with 'address fixes' commits that pollute history or rebases that risk making comments go into the ether (especially on GitHub). You might have extra squashes or rebases that make PRs that depend on other PRs painful. Mercurial instead makes a rebase or other history edit simply mark the old commit as dead and link to the new version, so that any other commits that depend on it can know how to be updated to the new version. And this information is spread to anyone who pulls from your repo, but need not be retained when pushed to anyone who didn't know about the old dead versions!


Like what?

How does it differ from an extra line on each commit message saying the branch name, and some options to parse it if desired?

I definitely get annoyed sometimes when I have to put in extra effort to figure out which side of the tree is which.


Pypy described the following in the FAQ:

> The difference between git branches and named branches is not that important in a repo with 10 branches (no matter how big). But in the case of PyPy, we have at the moment 1840 branches. Most are closed by now, of course. But we would really like to retain (both now and in the future) the ability to look at a commit from the past, and know in which branch it was made. Please make sure you understand the difference between the Git and the Mercurial branches to realize that this is not always possible with Git— we looked hard, and there is no built-in way to get this workflow.

> Still not convinced? Consider this git repo with three commits: commit #2 with parent #1 and head of git branch “A”; commit #3 with also parent #1 but head of git branch “B”. When commit #1 was made, was it in the branch “A” or “B”? (It could also be yet another branch whose head was also moved forward, or even completely deleted.)

In this post they say that "Github notes solves much of point (1): the difficulty of discovering provenance of commits, although not entirely"


The question in your example seems odd to me. It can be interpreted as either 2 OR 3 unique branches depending on how you read it.

There is either base branch A whose current head is commit #2 / branch B with head of commit #3.

OR

Commit #1 is branch “default” commit #2 is branch “A” with parent as commit #1 and commit is branch “B” with parent also as commit #1

Consider your same example with forking instead of branching, how would the issue be resolved?


> It can be interpreted as either 2 OR 3 unique branches depending on how you read it.

The question isn't how many branches, it's what branch the commit was on at the point in time it was created. That's not up to interpretation. It's information that was not recorded.

> Consider your same example with forking instead of branching, how would the issue be resolved?

Forked repositories don't have IDs, don't generally keep track of each other, and there's no way to even count them. So that's not solvable.

But branches do have names, and you almost always make commits onto branches. We shouldn't give up on tracking branches just because tracking forks is hard.


I mean you can't really compare them since git doesn't even _have_ branches as Mercurial understands them. git's branches would perhaps better be called twigs in comparison. git's lightweight branches better map to Mercurial's topics or bookmarks, though neither perfectly. And Mercurial has even lighter weight branches since you can just make a new head by committing without having to name anything, and it won't yell at you about a detached head like git will.


Speaking of git, for mega monorepro performance, we're gonna need synthetic FSes and SCM-integrated synthetic checkouts. Sapling (was hg in the past but was forked and reworked extensively) will be able to do this if EdenFS will ever be released, but Git will need something similar. This will require a system agent running with a caching overlay fs that can grab and cache bits on-the-fly. Yes, it's slightly slower than having contents already, but there is no way to checkout a 600+ GiB repo on a laptop with a 512 GiB SSD.


That already exists. It’s called Scalar[1] and it has been built-into Git since October 2022[2], dates back to 2020[3] and is the spiritual successor or something Microsoft was using as far back as 2017[4].

1. https://git-scm.com/docs/scalar

2. https://github.blog/2022-10-13-the-story-of-scalar/

3. https://devblogs.microsoft.com/devops/introducing-scalar/

4. https://devblogs.microsoft.com/bharry/the-largest-git-repo-o...


Scalar explicitly does not implement the virtualized filesystem the OP is referring to. The original Git VFS for Windows that Microsoft designed did in fact do this, but as your third link notes, Microsoft abandoned that in favor of Scalar's totally different design which explicitly was about scaling repositories without filesystem virtualization.

There's a bunch of related features they added to Git to achieve scalability without virtualization, including the Scalar daemon which does background monitoring and optimization. Those are all useful and Scalar is a welcome addition. But the need for a virtual filesystem layer for large-scale repositories is still a very real one. There are also some limitations with Git's existing solutions that aren't ideal; for example Git's partial clones are great but IIRC can only be used as a "cone" applied to the original filesystem hierarchy. More generalized designs would allow mapping arbitrary paths in the original repository to any other path in the virtual checkout, and synchronizing between them. Tools like Josh can do this today with existing Git repositories[1].

The Git for Windows that was referenced isn't even that big at 300GB, either. That's well within the realm of single machine stuff. Game studios regularly have repositories that exist at multi-terabyte size, and they have also converged on similar virtualization solutions. For example, Destiny 2 uses a "virtual file synchronization" layer called VirtualSync[2] that reduced the working size of their checkouts by over 98%, multiple terabytes of savings per person. And in a twist of fate, VirtualSync was implemented thanks to a feature called "ProjFS" that Microsoft added to Windows... which was motivated originally by the Git VFS for Windows they abandoned!

[1] https://github.com/josh-project/josh

[2] https://www.gdcvault.com/play/1027699/Virtual-Sync-Terabytes...


I worked on source control at Facebook/Meta for many years. On top of what aseipp said, I remember the early conversations we had with Microsoft where the performance targets for status/commit/rebase they wanted to hit were an order of magnitude behind where we wanted to be.

But most repositories are not that big so this is hardly an issue for most people. Personally, the system I'm most optimistic about in 2024 is Jujutsu. I've been using it full time with Git repos for several months and it's overall been a delight.


Every provider out there can talk a standard Git protocol, but all the features that don't have a standard Git protocol become a proprietary API. I think if Git (or a project like it) made a standard protocol/data format for all the features of a SCM, then all those providers could adopt it, and we could start moving away from GitHub as the center of the known universe. If we don't make a universal standard (and implementation) then it'll remain the way it is today.


> the script properly retained the issue numbers

Oh that's quite helpful. I was worried about how lossy the migration would be.


They are right. Mercurial is better than git for 99% of usecases. But we lost this one.


Git should be a pretty easy-to-federate system, at least in terms of mimicking pull requests. Is there anything that tries to do so? Gitea?


iirc GitLab is trying to federate using ActivityPub

Edit: yes, https://docs.gitlab.com/ee/architecture/blueprints/activity_...


Codeberg is also working on federating, or maybe they already do. My experience using them was quite unpleasant, though, they're very feature-incomplete.


ForgeFed sounds promising: https://forgefed.org/


I used to use Mercurial as well and greatly preferred it, but for better or worse, Git won. I started using Git several years ago and haven't looked back.

No matter what people might say, I think this stuff matters for contributors and users who might be looking at your project, and git/github is the typical expectation. This is likely the right decision, as they are now ubiquitous.


Same story for us, started with mercurial many years ago, eventually the tooling around git and just "using the standard" was too big to ignore and we migrated along with a bunch of other CI/CD and DevX improvements. Mercurial was cool, but lacking support meant little things like Jenkins having to "pull" 3x instead of 1x for git natively along with many of these little things meant just using git, generally saved us a bunch of work.


If you liked the interface of Mercurial, there is https://foss.heptapod.net/mercurial/hg-git


I've used that in the past, but it doesn't really work that well on large projects--it basically works by keeping a hg and a git version of the same repository, and storing a mapping between the two, which scales really poorly with multi-million commit repositories.

What I really want is something that will let me use the interface of hg's power tools (revsets, phases, changeset evolution) on an existing git repository.


You will probably like Jujutsu, which takes much inspiration from Mercurial, and even has a few prior Mercurial hackers working on it. It uses the Git data model underneath (so feel free to use GitHub), but has an entirely rebuilt UX and set of algorithms: https://github.com/martinvonz/jj

It isn't a 1-to-1 hg clone, either. But tools like revsets are there, "anonymous branching", log templates, cset evolution is "built in" to the design, etc. There is no concept of phases, we might think about adding that, but there is a concept of immutable commits, so you don't overwrite public ones. The default output is designed to be succinct and beautiful, so it remains relevant on high-traffic repositories with lots of work-in-progress patches, and many developers.

It also has many novel features that make it stand out, like the working-copy-commit. We care a lot about performance and usability; to the extent performance is bad, some of it comes down to piggybacking on Git's data model and existing performance issues. Give it a shot. I think you might be pleasantly surprised.

Disclosure: I am a developer of Jujutsu. I do it in my spare time.

P.S: You might alternatively like Sapling, from Meta. It actually is a fork of Mercurial (you can see it in the UX and features) but is very different now; in particular it also uses the Git data model for the storage layer, so it works with GitHub. It will probably feel more familiar than Jujutsu at first. And it has some absolutely amazing features like `sl web` we can't match yet. https://sapling-scm.com/


That Torvald’s second biggest creation is now most closely associated with a Microsoft company gives me feelings.


Me too. It feels great to see such a clear example of building a successful business on an open source tool and ecosystem. Git is massively popular and actively used, GitHub built a huge community for discovering and interacting with open source projects, and since the acquisition, Microsoft-owned GitHub has continued improving their platform without breaking interop with the open spec.

Everybody wins.


But what about compatibility, is it fully compatible with Git?? How will the contribution work flow change?


This is a tragic, wrongheaded move, and I say that as a big Git enthusiast (but a Github hater, to be fair...)

I don't think PyPy gains anything from this, not even a reduction in the annoying messages that have been psychologically torturing the maintainers. If anything, you're just opening yourself up to more common and frequent low-investment pestering.


Better late than never. Here's hoping that means things like `pip publish` are back on the table, too.


Are you confusing PyPI and PyPy? Easily done!


I sure am!


> foss.heptapod.net is not well indexed in google/bing/duckduckgo search, so people find it harder to search for issues in the project.

SEO : WWW structure :: gravity : orbital mechanics


What does this mean?


The web is shaped by the needs of the indexes as the solar system is shaped by gravity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: