Improving GitHub for science

timr · on May 14, 2014

So, aside from DOIs (which anyone can create: http://www.doi.org/faq.html), this post is mostly about free private git repos.

But the thing is, anyone can get as many free private git repos as they like at bitbucket.org right now. You only have to pay when your team is over 5 people, and even then it's pretty cheap. So if you want hosted git but cost is an object, just use Bitbucket. There's no reason to pay a premium for hosted git. It's a commodity.

cben · on May 15, 2014

Another tiny attraction of bitbucket for science is MathJax support in reStructuredText [0]. Unfortunately not implemented yet for markdown [1].

There is an unfilled niche for version control with out-of-the-box rendering of scientific formats: - mathjax everywhere - And of course full latex. This may be specialized enough to be better served by separate services, but now that sharelatex is open source a tighter integration is possible. - bibliographies (rendered as publication lists?) - Notebooks like IPython [2] and Sage; literate R. - plots of various kinds - tons of formats I never heard about (chemistry, bio, astronomy...)

Third-party web renderers exist for most formats (e.g. nbviewer.ipython.org) but there would be value for doing them all in one place - making repos pretty enough to serve as poor man's website (think how Github's up-front README rendering was good enough for many projects).

The obvious direction is improving[3]/forking Gitlab. I hoped Banyan.co would be that but just discovered they went down :-(. Authorea does some of this. And cloud.sagemath.com exposes an impressive amount of tools under one roof. Though it's focused on working more than browsing.

[0] https://bitbucket.org/cbensf/test-math For some reason not working at test-math/src view (reported https://bitbucket.org/site/master/issue/9483) [1] https://bitbucket.org/site/master/issue/7908/enable-mathjax-... [2] http://nbviewer.ipython.org/urls/bitbucket.org/mforbes/paper... [3] http://feedback.gitlab.com/forums/176466-general/suggestions...

BTW, Gitlab.com also has free private repos.

guynamedloren · on May 15, 2014

We're on the same wavelength here. This is exactly the problem I'm trying to tackle with Penflip (https://www.penflip.com/).

  - Began as a fork of Gitlab
  - Hosts public and private writing projects that are backed by git repos
  - Git repos allow for local editing, remote access, and easy collaboration
  - In-browser markdown editor with syntax highlighting
  - MathJax support (as of yesterday - still testing and tweaking)
  - Extended markdown support for tables, footnotes, etc 
  - One-click downloads in PDF / HTML / ePub / Word format

Originally, Penflip was geared towards writers, but I am seeing an increased demand for scientific and academic uses. I'm exploring this right now.

EDIT: wow, just realized you're the one behind mathdown.net, which I recently discovered while researching MathJax. Excellent work!

cannam · on May 15, 2014

Now that's an interesting site.

It would be cool if the free tier included one private project, so you could test it without making a public project full of typical test blathering. (I see the Discover page lists a few projects that are clearly only testing.)

The git import/export is very nice but looks a bit magicky -- am I right in thinking text files called anything other than document.txt will be silently ignored? I was looking at this with a view to working out how to import existing documents, ideally with history: what's the best workflow for that?

A serif font option in the editor/preview would be welcome.

Looking forward to trying this for something more substantial, though. Nice.

cben · on May 15, 2014

Small world, I just (re)discovered Penflip last week. Great news, now I have to take it for another spin :-)

mathjax in penflip thread: https://www.penflip.com/Penflip/help/discussions/6

ylem · on May 15, 2014

Penflip looks interesting! I'll have to play with the math jax options, but do you have any examples of it being used with figures and equations? Any plans for reference management?

isbadawi · on May 14, 2014

GitHub is not just hosted git. There are other features like the issue tracker, pull requests, editing files and committing from the browser, etc.

Also, if e.g. a research group already has a GitHub organization with their open source work (many do), then it's reasonable for them to want private repos under the same umbrella so that they can collaborate using the same tools irrespective of whether the work is open yet.

timr · on May 14, 2014

"There are other features like the issue tracker, pull requests, editing files and committing from the browser, etc."

Yes, bitbucket also has issue tracking and pull requests. I have no idea if they have file editing from the browser (though they probably do), because I'm a programmer, and I have an editor.

I know this is hard to believe, but Github really is a commodity product. There are many alternatives, and they all work pretty much the same -- particularly if you know how to use git from the command line.

bachmeier · on May 15, 2014

"I have no idea if they have file editing from the browser (though they probably do)"

They do. It's turned out to be an awesome feature for me as an org-mode user (so I obviously have an editor). You can add to org-mode files from any web browser and then it's on all your machines with the next git pull.

sampo · on May 14, 2014

I use the side-by-side diff view feature in Bitbucket a lot, and I am kind of amazed Github does not have that.

danso · on May 14, 2014

"I know this is hard to believe, but Facebook really is a commodity product, and Friendster and Myspace have been doing it since years ago, and Diaspora is going to do the same thing, except be private and secure" - said someone everytime FB added a controversial feature.

There's definitely buy-in and momentum of established products...but sometimes, products in the same space can be differentiated in ways not directly related to their core. Github makes it substantially easier to share and collaborate. And that is why many people prefer it. If you reduce it to "Well other places have git, too"...then that's missing the point of why Github appeals to so many.

timr · on May 14, 2014

"Github makes it substantially easier to share and collaborate."

Unless you're using git, in which case, it's exactly the same: unlike facebook, I don't need my best friends to be in my git repo to make it useful.

In any case, I'm sure there are some folks for whom pull requests or browser-based editing (again: not unique to github) make the difference. But for a team of developers who need private repositories, it's six of one, half a dozen of the other.

isbadawi · on May 14, 2014

I wasn't implying that these features were unique to GitHub, just that there can be reasons why one might prefer to use GitHub beyond it just being a place to host git repositories.

I don't think "hosted git" accurately captures the appeal of GitHub (or bitbucket for that matter). While this is the main premise behind both products, I don't think they're necessarily interchangeable in the way you describe.

tonylampada · on May 15, 2014

And then there's gitlab.com with free unlimited private repos. I guess people just didn't find out about it yet.

maccard · on May 15, 2014

Which you have to host yourself. Not the same thing.

bradgessler · on May 14, 2014

... meanwhile Github doesn't support viewing rendered SVG files when you click on a .svg file, yet they have viewers for 3D objects and maps that render in SVG.

It's frustrating how difficult it is to get anybody at Github do work on small things since there's "no manager" and everybody wants to focus on grandiose projects.

jahewson · on May 14, 2014

Browsers are really not good at rendering SVG. Both WebKit and Gecko are riddled with bugs once you start using the non-trivial features of SVG. Worse still, browsers don't even support the latest version of SVG, which suffered a fate similar to ECMAScript 4. However, vector illustrating programs will happily generate files which use the browser-incompatible newer versions.

So while it may be possible to target SVG as a backend for rendering basic drawings, displaying arbitrary SVG files is, in practice, a lost cause.

TTPrograms · on May 15, 2014

They don't have to send the straight svg for preview. They could either modify it for compatibility or at least render it server-side to raster and send it over. It's not impossible, at any rate.

nbody · on May 15, 2014

It's not only a matter of compatibility but a multitude of security issues as well.

santaclaus · on May 15, 2014

Lackluster SVG support doesn't stop with browsers. Even Adobe Illustrator manages to royally botch SVG support -- try anything with transparency.

privong · on May 14, 2014

I like that the DOI association is with a tagged release, making it easy to identify specific versions of the repository. In principle, this will make duplication/checking of research results easier, as one can ensure use of the same version of the software as the original work.

hrjet · on May 14, 2014

Tags can be moved in Git. How much you want to trust the tags depends upon your use-case.

trurl42 · on May 14, 2014

The DOI is linked to a zip archive on zendodo / figshare, which will not change, even if you change the tag.

sampo · on May 14, 2014

So if you publish the DOI as a part of a paper, and then later improve the code in the Github repo (but of course not in the archived snapshot), is there some way that the people who follow the DOI could also find the improvements?

amethyst · on May 14, 2014

If you sign your tags with your GPG key, and only rely on tags signed by GPG keys that you trust, then you can trust the tag as much as you trust the holder of the key.

btn · on May 14, 2014

Something they don't mention until after you've signed up: the micro plan only lasts for two years. I assume any private repositories will become locked if you don't pay for a subscription after that (as with regular accounts).

In comparison with BitBucket (not to advocate, but they offer a comparable service): the restrictions they waive for academic accounts are done so permanently.

rxdazn · on May 15, 2014

You can renew it after two years though (still free of charge).

btn · on May 15, 2014

Is that spelt out somewhere? The only mention of it I can find is in the confirmation email they sent: "will be free for the next two years".

rxdazn · on May 15, 2014

I don't think so. I just filled the form and my request was approved (they did check my school email).

diego898 · on May 14, 2014

Im very excited that github is pushing forward with this. As a graduate student Ive been desperately trying to get my lab to switch over and use github as opposed to myfile.m --> myfile_diego.m --> myfile_diego_changed.m etc!

I want researchers to use it not just for code, but for latex files for paper writing as well!

anonymousDan · on May 15, 2014

Our department recentlyvset up a self hosted gitlab. Works great. As many private repos as you like and you don't have to worry about github disappearing off the map.

Xylakant · on May 15, 2014

Seriously, the chances that github just disappears off the map are about the same that the gitlab team decides to stop working on the OS product. Non-zero, but still low. However, the impact is only mediocre: It's simple to keep a copy of the repo around and just a simple to keep a copy of the wiki around: They're git repos too. The only thing you'd loose is the issue history and your network of clones.

If that's what you're concerned about I'd be more afraid of your internal gitlab server crashing down in flames because somebody fat-fingered a critical update.

danpat · on May 14, 2014

Reminds me of the classic:

http://thedailywtf.com/Comments/The_Best-est_Version_Control...

onalark · on May 14, 2014

This is very good news for digital science, and has been a long time coming. For those wondering why this is new and different from previous efforts, I will try to summarize.

Up until now, there have been many places to permanently store your public scientific research objects, so long as you were willing to use a Creative Commons license. Similarly, there were many places to freely host your software, so long as you didn't need a permanent store for your scientific code.

The GitHub-ZENODO partnership is the first web application to bridge these two needs. An earlier attempt with Figshare almost got it right, but failed to support code licenses (everything at Figshare is CC BY). This isn't the only way to host or cite your code, but it's definitely a good default.

hueving · on May 15, 2014

Most universities already have systems to host research long-term. Most researchers are just too lazy to use it. Also, I would place more money on a university being around longer than Github if I was worried about longevity.

onalark · on May 15, 2014

I agree that University archives present another opportunity, particularly among the "first-tier" research institutions in the US. One problem is that many University archives are paywalled or require restrictive licenses. Another is that these archives generally do not accept submissions from outside the University.

Let me turn this around on you. Given the availability of University preprint servers, what's the value of a service like arxiv?

dnautics · on May 14, 2014

I'm not so sure about allowing for private research repos. As a former scientist, incentivising siloization seems to be the wrong direction; if they want it, they should pay for it. A normal micro plan is cheap enough that it won't break the wallet of a researcher.

eob · on May 14, 2014

As long as society continues to reward individuals for generating new research, individuals will, out of necessity, keep their research materials private until they have finished extracting personal gain from it. Otherwise they would go through all the pain of working on a problem only to have someone swoop in 10% before it's complete and cross the finish line holding a borrowed baton.

dnautics · on May 14, 2014

If it's done in the open, it's pretty trivial to figure that out and the person who swooped in runs the risk of being shamed. The few egregious cases of this I can think of (Leo paquette and Armando Cordova) only were able to operate because of siloed work and secrecy; and still they got exposed.

fred_durst · on May 14, 2014

> only to have someone swoop in 10% before it's complete and cross the finish line holding a borrowed baton

Indeed, Github can't be expected to, nor is it capable of, fixing one of the larger flaws in neoclassical economics.

ISL · on May 14, 2014

Open access to your scratch work while you're working is a ticket to getting scooped or to having people form misinformed and incomplete perspectives of what you're working on.

Private repos are the only way that my group is willing to use GitHub. Upcoming federal requirements may force us to divulge all of our code in the future. That's a very hard sell to the professors.

jcrites · on May 15, 2014

> Upcoming federal requirements may force us to divulge all of our code in the future. That's a very hard sell to the professors.

Do you mean before or after publishing the paper? If it's after / as part of publishing the paper, what makes it a hard sell?

ISL · on May 15, 2014

There's a big difference between publishing a paper, where, in some cases, every character has been vetted multiple times by every author, and opening your entire codebase and raw data to the world. Fully documenting and characterizing everything for external consumption adds a large burden to the task of publishing. It may be worthwhile, and perhaps even essential, but it requires a lot of effort that could be directed toward further research.

From a social perspective, opening up your data gives lots of angles of attack for others. (See the difference in interpretation of gamma rays from the galactic center from Fermi/LAT between the telescope collaboration and outsiders, for a topical example). Again, it may be healthy to open everything, but if you've spent 5-10 years trying to build an unassailable measurement, handing out every last bit of dirty laundry to your critics can be daunting. Science ultimately reaches the truth, but it's a lot easier to get tenure if your measurement isn't controversial.

camus2 · on May 15, 2014

then pay for it.Why should developpers pay for it and not scientists?

ISL · on May 15, 2014

We do.

diego898 · on May 14, 2014

Im not sure I agree with this, especially when you want to get your lab to switch at first. The fact that its free is a huge bonus to get my lab to "just try it"

dnautics · on May 14, 2014

Ah, good point.

ylem · on May 15, 2014

As a current scientist, I have mixed feelings. I'm a neutron scatterer and at an increasing number of facilities, our raw data is available to the public as we take it (or after an embargo period) unless it's proprietary (that is, someone like GM is paying to use the facility). Personally, I think this is a good thing if someone needs to check a result later. It's also convenient for accessing data after an experiment (we have a model where scientists go to different facilities to perform experiments).

I do try to put code out in public repositories, but I will say that often what stops me isn't a question of a competitive edge, but rather the fact that my code is ugly. For example, I have developed a small code (relatively slow) that calculates an instrument resolution. I wrote it back when I was first learning python and it is truly ugly and there are a number of things I would do differently today if I had more time. Recently, someone from another facility asked for it and I had some reluctance because of that ugliness. Eventually, I just made a git repository and pointed them at it (along with an offer to help if they got stuck) because I realized that otherwise it would never get out.

However, I have met people who refuse to share their code and figure that it's an easy way for them to get added to a paper...

BUT--I will say that while I'm working on a paper, it's an entirely different story. I wouldn't want to put code that's being used to reduce a specific set of data (along with the data) out in the wild before the paper was accepted. I would say the same thing for drafts of the actual paper, grant applications, etc. That would seem like asking to get scooped.

As for the question of it breaking the bank--I do pay for private repos, but if you want to change the practice of people just swapping files back and forth, then the barrier to entry needs to be extremely low....

On a final note, I will say that my facility does develop code for general use and we do put it out on github/googlecode, but that's different than "adhoc" code...

dnautics · on May 15, 2014

Why are you so afraid of getting scooped? You put your data out there on a repository. A third party has managed time stamps on your data. Jerkface 'scoops' you. Then you complain, prove that you didn't steal it from me. They don't have their data in a public repository. They lose.

epistasis · on May 14, 2014

While keeping everything public is probably good for bioinformatics projects that focus just on building tools, it would be disastrous for other areas of research where revealing a particular gene under investigation or specific genomic locus may result in loss of credit. This problem must be fixed at the publication, CV, and tenure committee level, not by GitHub.

dnautics · on May 14, 2014

well, I'm not a bioinformaticist (as a matter of fact my personal intent is to develop pharmaceuticals in the public), but I do see it as a chicken and egg problem, in other words, it's possible for github to become an instrument for that change.

cannam · on May 14, 2014

Does it continue to be free after you stop being an academic?

gertptr · on May 14, 2014

I applied for the discount and they asked for my graduation date. I presume that they will remove the discount after the given year. Not sure about the permanent positions, though.

Edit: Github approved my discount and gave me a coupon for 2 years (even though I will graduate later than that).

sixbrx · on May 14, 2014

I don't think their Markdown flavor supports any Latex extensions, that would be a good start, IMO.

onalark · on May 14, 2014

It used to, but MathJaX does not play well with GitHub's Markdown parser.

alceufc · on May 14, 2014

Is there any style or guidelines for citing code (e.g. as we have APA for citing papers)?

Usually when I want to give credit to a software or code that I have used I cite the paper that describes the that software or code.

Another point: if you attribute a DOI to your code, that could result in it "stealing" the citations from a paper that describe that code.

mandalar12 · on May 14, 2014

Your latest point can be resolved through explicitly asking authors to cite the paper rather than linking to the website/github repo.

For instance I used Intel Pintools and they specify in their FAQ[1] : I used Pin for my latest paper. What citation should I include? [actual paper here]

[1]: https://software.intel.com/en-us/articles/pintool-faq

gone35 · on May 14, 2014

Yes; plenty of them [1,2,3,4,5,6,7].

[1] http://writers.stackexchange.com/questions/7797/what-is-the-...

[2] http://integrity.mit.edu/writing-code

[3] http://software.ac.uk/so-exactly-what-software-did-you-use

[4] http://uark.libguides.com/content.php?pid=155080&sid=1780817

[5] https://owl.english.purdue.edu/owl/resource/560/10/

[6] https://owl.english.purdue.edu/owl/resource/560/25/

[7] And many, many more at: http://lmgtfy.com/?q=citing+source+code. I apologize for the snark, but it is a not-too-abstruse and rather common question that is very easy to figure out online without any special or prior expertise; and so I think contributes very little to the discussion.

ylem · on May 15, 2014

This is actually rather cool! There has been a movement in some fields towards reproducible research (http://reproducibleresearch.net/). If people were required to provide source code in their publications (by funding agencies), it would be rather useful for those cases where there are questions about the data analysis (for example: http://en.wikipedia.org/wiki/Anil_Potti). It would allow a serious referee to check to see if there were either obvious or subtle mistakes in the analysis. Also, it would allow subsequent researchers to see what previous researchers actually did in the reduction--not just the final figure.

cannam · on May 14, 2014

This seems like quite a big deal -- the parts were in place already, but having a route promoted by Github will probably make a big difference in academia. The idea of having a citable DOI for your work exerts a magic that is possibly out of proportion to its practical advantages for research code.

I imagine the private project thing may be a blow to Bitbucket, which is quite widely used by academics.

The pessimistic part of me finds it quite discouraging that this will only centralise things more at Github. The optimistic part hopes for more matter-of-course publication of research code and a happier relationship between researchers and the software they write.

(Disclaimer: I run a subject-specific academic code repository, http://code.soundsoftware.ac.uk)

onalark · on May 14, 2014

And it's a very nice site! I might suggest making it more clear one of your help or about pages what licenses are acceptable for hosting. I am not a UK academic or a sound analysis computational scientist, but it's something that I would be concerned about when choosing a place to store my code.

dalek2point3 · on May 14, 2014

I've been thinking about this for a while, and thought github finally understood the value of a "git for data". Github really needs to be data repository, with "Code" being only one type of data. But then they become a hosting service for large data files, and that is a headache and a different business. I dont see how academics who work with reasonably large datasets would move to github as long as they have to manage their data on another platform. Perhaps Github could have deep integration with AWS / DO and collaborate to solve that pain point.

hyperion2010 · on May 14, 2014

I have been working on a little python cli project electrophysiology. I will write a paper about it but it is such a simple and small project that many journals may not want to publish it, however it could be really useful for other experimentalists. I personally don't care too much about giving credit, but I want people to be able to clearly communicate what tools they used to do their science and citing github repos or tags is not really done at the moment. Kudos to all involved for getting this to work!

onalark · on May 15, 2014

This is a perfect fit for the SciPy conference. There are also several open journals that focus specifically on publishing openly licensed software used in academic work.

Citing software really varies by community. It is very consistently done in some communities, and is gaining broader acceptance in others, some with surprising rapidity (ecology comes to mind).

It's an exciting time to be a scientist!

ISL · on May 14, 2014

Anyone else having trouble verifying an existing primary .edu address (no "Verify" button is apparent)?

sp332 · on May 14, 2014

Have you already verified it?

ISL · on May 14, 2014

Checks old email... yes (thanks :) ).

dllthomas · on May 14, 2014

So that's "Improving (GitHub for science)", not "(Improving GitHub) FOR SCIENCE!".