-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What information is needed to reuse code? #2
Comments
A few off the cuff thoughts: at least minimal automated tests so we can be sure the software runs properly; @mr-c, what are we missing? |
a quick one, that I see missing (and sometimes keeps me from using) lots of code from research:
|
shameless plug:
|
These help users make a decision, and benefits the authors since it reduces negative feedback from people expecting it to do something it's not designed for. |
I know if is more from the prospective of someone that want to publish code, but I'm often missing a good tool that generate all this boilerplate and set the distribution of software. Beeing mostly a python dev, I'm still highly annoyed when I need to publish python code even after all this time, I'm much more impressed by language like julia that have a built-in mechanism to generate a package in minutes. (set up github unit-tests , performance test, register the name, set-up doc in less than a minute) an npm (javascript). Then as a "user" of that code I shouldn't need to know how to install this package it should "just works". And this is true as code author, if I can't use language way of installing things on other computer in my lab in less than 10 minutes, I won't even try to make this work across computers. This is (for me) more than 80% of what is needed to reuse code :
|
Thanks, @ctb! Only concern is that for researchers who don't indentify as "computational scientists" who may be doing something slightly more entry level, is that too onerous? Thinking the bare minimum here for someone to be able to get up and running with the code ... |
Great stuff, all. Keep it coming. :) |
I think it's important to distinguish short-term goals (What can we do right now? What should we recommend as best practices?) from long-term goals (In which direction should we develop our infrastructures?) In between we have the category of "thin layers on top of our current toolstack that would make life easier". The comments I see are about short-term and mid-term tasks, so I'll tackle the long-term directions. The number one long-term goal for me is a stable code representation for scientific software. This can't be source code: programming languages are a user interface, which we want ever better and adapted to our domain-specific needs. So we will always have multiple and evolving languages. Machine-level code can't be stable either, because hardware evolves as well. So a stable layer must be somewhere in the middle, at a level that no one cares about too much to reject it. JVM or CLR bytecode are at the right level, but not so well suited for scientific applications. Why do we need a stable code representation? For two reasons:
|
Great point, @khinsen. For this, we're thinking more the 5 fields you fill in to go along with your code as you push a release to say, figshare, so that someone can meaningfully glean (without much pain) what the code does, how to run it, etc in minimal time. There's of course a longer term play here, but styling this in the ilk of some of the standards listed in the blog post, looking at a) whittling it down to the basics for this first instance (they're easier to implement, higher fill rates) and b) for those who do not identify as "computational researchers" necessarily. Stellar points. |
[edit - this is all with regards to the building, distributing and running part] Docker would be an obvious candidate for this kind of thing: https://www.docker.io/ A public repository with a dockerfile allows people to build the software, and you can distribute built images easily and efficiently. It'd leave the explanation of how to just run the system as simple as
All dependencies would be contained in the image, which could run optimised BLAS libraries or python or whatever they want (as long as it works on linux). Converting something that already runs into a docker image is as simple as providing a build script (which should read pretty much like a good README setup section), no changing language or anything like that. If the original researcher uses this to manage their dependencies & build process, then you also know it will actually run. Reviewers could also easily run the software without fighting build systems. |
@IanCal for me this is reproductibility, not reuse, if you got 2 project having a docker image A, and B, how do you use A in docker image B ? Agree that it might be helpfull, but it should be an extra to have something like that available. not the main point. |
@Carreau There's nothing stopping you from reusing the code within it. You'd publish your code, a script for building it, and a fully contained runnable system. This means you know all the dependencies and you know it'll build. Many things may benefit from being structured such that they communicate over the network, in which case you can very simply link your docker container and theirs. This allows you to have isolated dependencies, so you're not stuck because you both require different, conflicting dependencies. |
Here's a diagram of a subset of EML metadata proposed back around 2000 for software for reuse on eco and enviro data: |
It may help to consider what sort of endpoint could be achieved by truly re-usable code. To my mind, a good outcome would be to have a standardised system that could build complex software projects for you out of combinations of other software projects that were available in the unified, re-usable format. There are a few examples of this type of thing that already exist: Automatic installation systems for *nix systems e.g. MacPorts, HomeBrew, Yum, Apt-Get. There is a standardised 'make' format and a standardised meta-data format for storing your project in a repository that allows these systems to search for and install/build/utilise disparate and non-uniform projects. Web documents and browsers. Each browser ought to be able to interpret the structure, layout and information present in billions of disparate projects. This is done via separating information from metadata and layout. Tags like 'document type definition' could be very useful for specifying programming language and version for non-web based projects. The pipelining software, Galaxy (galaxyproject.org) or other variants e.g. Taverna. I've been using Galaxy to make my disparate conglomerate of R, Python and Matlab scripts and tools more accessible to the biologists in my lab. Each tool is 'wrapped' with an XML file that describes all possible inputs, outputs and command line instructions. By putting this uniform descriptor on each uniquely designed project, they are able to be lassooed together into one large complex project. I honestly think that something like Galaxy is very close to where you want to go with this, but the next step would be to design a universal standard for this type of meta-data wrapper system so that a) it is optimised/advanced and b) anyone can design a Galaxy-style system for pulling projects together, without necessarily having to be a Galaxy expert first. Finally, while that system is useful for merging commandline-callable standalone tools, i think it would utlimately be possible to have a system that built tools for you by wrapping libraries and packages in this way along with 'include and calling methods' rather than 'command line instructions'. |
@kaythaney I think we could use a bit more guidance here as to what problems this question seeks to address. I think you're asking what we should consider the minimal metadata fields for scientific software (e.g. title, version, maintainers, etc). Other things like code documentation, unit testing, functionalized code, good API, knowledge of the algorithms and their limitations, etc can all be necessary for reuse but aren't necessarily things we can treat as software metadata. Regarding minimal metadata, I think it's instructive to consider what various software packaging systems have decided should be minimal metadata: e.g. R's DESCRIPTION files on CRAN, Perl's META.yml files on CPAN, Ruby's While I don't think any of these provide an exhaustive list of what is needed to "re-use" the software, each has been developed and tested within it's own ecosystem to be a reasonably reliable source of information to (a) facilitate installation by handling dependencies, etc, and (b) provide enough information for users and developers to search for desired features (like "parse xml") within a repository, and (c) usually provide some indication of how to get documentation and/or support for the software. Notably too, none of these systems try to answer "can I re-use this software", or "can I trust the results" (questions best left to the user community to evaluate), and yet people build upon these systems all the time. So if Debian etc have effectively answered what metadata they expect from package providers to help promote re-usable software on their platform, I suspect the question is what are the metadata elements (and format?) to promote reusable software in scientific research? It seems to me that part of the answer is simply to meet whatever the platform / system specific standards are for software distribution (R software should use the package system with valid DESCRIPTION file, provided on CRAN, etc). Given that, we might then ask are there elements missing from any of these existing standards that we would expect should be part of the minimal metadata for a scientific package? (a DOI? Citation(s)? particular licenses? Bug tracker/mailing list/maintainer contact? Description? Keywords?) What fields should be optional, and what required? p.p.s. A related but different question is whether or not there should be a standard metadata format for scientific software, such as one might collect into a central searchable repository akin to the language-specific ones. Personally, I think success in so doing is hard and we should instead build on the language-specific repositories, but I could be convinced otherwise. |
I think there needs to be a philosophical shift amongst scientists who code for their research as well. It's been discussed that the largest reasons code isn't shared is because the authors feel that their code isn't clean enough to share, the code has poor documentation, researchers don't want to support their code when others have trouble with it, and authors don't want their code scrutinized if it could invalidate their results (see Shamir, L. et. al. Astronomy and Computing 1(2013) 54-58 http://dx.doi.org/10.1016.jascom.2013.04.001). These are all valid concerns, but if 87 % of authors aren't publishing their code, these reasons must be overcome. I'm not sure what that will look like. Obviously, if code is horrendous, and poor programming results in bad data, it could lead to paper amendments or retractions. But at the same time we have to express to authors that they don't need to maintain their code for years and years with installation and usage support or bugfixes/enhancements over time. The purpose of code sharing from a scientific point of view would be to allow other people to run the data through the same software and get the same result as a validation of the work that was done. Code should be published so that new data can be compared to the old to support or refute hypotheses. If that means that code is attached as an addition to the supporting information of a paper, and nothing more, then so be it. While the end goal may be for a centralized repository with one button setup and execution that anyone and everyone could use (as discussed well by @bobbledavidson, @IanCal, and others above) that won't happen without the acceptance first of sharing code in whatever ugly, unsupported fashion it may appear. |
@pbulsink Great point. I think it is worth distinguishing between what researchers release as "software" with the intent for reuse (and usually described in dedicated "software papers"), vs code snippets and scripts that just document what the researchers actually did in a particular publication. I was thinking only of the former case. The very same concerns arose during the first phase of the Mozilla Code Review project (e.g. see http://carlboettiger.info/2013/09/25/mozilla-software-review.html) I agree entirely with your perspective that the first step is simply getting people to publish the code or scripts they used, regardless of what they look like (as eloquently argued by Nick Barnes in Publish your computer code: it is good enough, and by Ince et al who provide a damning critique of why pseudocode isn't enough in The case for open computer programs). Perhaps it is silly to try and distinguish between 'software intended for reuse' and 'code as supplemental methods documentation' or some such, I don't know. At the moment I'm in favor of treating them as different concepts and holding them to different standards. Still, providing simple guidelines on how to increase the reusability of code that is otherwise ugly, un-abstracted, and intended only to show what a particular author did to get a particular result is not necessarily a bad thing, and need not discourage others from sharing in whatever way they see fit. Here, simple practices such as declaring dependencies with versions, dating the script, and providing it as a text file instead of a pdf image (yes, I've reviewed more than 1 paper in which code was included as a pdf file) could go a long way. |
If we are prioritizing these needs, I'd say what does the code exactly do is the hardest challenge, especially for leveraging plenty of legacy code. This includes the short description and some meaningful documentation (both of which remain really short or as placeholders for a future time after all other work is done. This time never comes). For example I find tons of useful code but little to no documentation, no details on which implementation was used, and any examples of where the code was used. If I have to spend a significant amount of time understanding the ins and outs of code, I'm better off starting from scratch. My most recent research coding (just to distinguish from all the other coding I do) suffered from this exact problem. I found many statisticians had written some implementation of the lomb-scargle but never in a form that I could easily reuse. License: Often there is none. But this is one thing that can be solved with better training for scientists.
This is a challenge but it can go either way. Depending on the software and the packaging system, it can be really easy or super painful. But finding that out doesn't take much of a time investment (easy to see ones that are hard to use and move on). One more not on the list. |
@kaythaney, trying to get back to something concrete -- if these are blockers,
then you are not ready to write code for yourself to use, much less anyone else. I'm even willing to equivocate on #1 ;). (Yes, I removed the Software Sustainability Inst stuff -- too vague) |
Building something other people can download, install, understand, and use is roughly 3X the effort of building something that works for you on your machine [1]. What's the incentive for the working scientist to put hours (days, weeks, ...) into reusability instead of using their software to produce another publishable result themselves? |
👏 to what @gvwilson said. But is this particular discussion about incentives? |
Well said @karthik . Without detracting from the importance of incentives, its worth knowing just what needs to be incentivised. "Reuse" is too vague. Nick Barnes and @pbulsink argues persuasively that it is just the publishing of whatever scripts the authors used. Others on this thread have argued just as persuasively that a lot more effort than that is needed for something to be really reusable. Only in very few specialized cases have the publishers offered clear guidance (e.g. Journal of Open Research Software, which has just a few simple additions beyond @ctb 's list). Without any guidelines, even well meaning folks will share code whose reuse is hampered by things that take minutes, not hours to weeks, to fix. Setting the bar too high will only cause trouble. I believe there would be great value in a community consensus middle-ground. Consider this script as typical-to-above-average example of things I see in my field where someone has bothered to share code. Using the imperfect criterion that the publication mentioned is indeed a valid example of both the intended purpose and intended output, I believe this could arguably meet JOAR's criteria (quoted below) for software merely by (a) moving it to an established repository and (b) adding a license.
I'm not saying these are the right criteria, or that these criteria make this reproducible, that's all good stuff to argue over. I'm only trying to illustrate what a middle ground between the 3X effort and "publish your code, it's good enough" might look like. (Noting too that criteria may differ for different types of code, e.g. software vs snippets like the one above; and potentially also between languages, which face different challenges in certain issues like cross-platform compatibility). |
Regarding @gvwilson and @karthik's comments about incentivising scientific programmers to add in these extra levels of work - I'd like to state that while many programmers would like to make a perfect code snippet for re-use by all in sundry, many will feel that they do not have time. I am often asked to 'just make up some code' to get something working or to try something out but I'm never encouraged to develop that into a proper tool because the 'effort to reward ratio' is to low- or so i'm told. If re-usable snippets of code were able to be counted as examples of successful outputs in government research funding proposals etc then the 'effort to reward' ratio would shift and I'd be encouraged by my boss(es) to add the test-data and meta-data and to make my work public. With regards to @pbulsink's thoughts on encouraging people to share their work - I think that having a standardised format would actually help convince people to share their scrappy, unsupported work, IF there was a basic level that didn't require test-data, support etc. A lot of programmers who know they haven't put enough effort into their code do not want to show this as an example of their work to the community. But if there were different levels of release in the same way that there are different types of GNU license, CC license etc. then someone could present their work in the "don't blame me, this is just a quickie" category and not expect any negative comeback. For example, if there was a standard metafile format for sharing snippets of code (or whole projects) that at the lowest level only required putting details such as: title then authors could happily add this tiny file to their e.g. GitHub repository and make the code public without worrying about being asked for help all the time (by saying 'not supported') or worrying that someone will think they do shoddy work (by stating that it's low level release). |
Code discovery and reuse would be aided by implementing standardised software metadata descriptor files. Human readable descriptor files could be developed alongside machine readable descriptor files. Machine readable descriptors would aid discovery of relevant code, human readable descriptors make it possible to evaluate the search results for relevance. If people can find relevant code easily, they are more likely to reuse it. The contents of the descriptor files would need to be agreed by community consensus, could this be a viable goal for this group? |
|
Admittedly, the last few decades have not shown this type of coordinated uptake of standards but the last two decades have seen programming methodology change first with Java and Object Oriented programming, then with all the advances in web technology, realisation of ubiquitous computing and now 'big data' in all its forms. I think with the likes of github and other 'social' code sharing initiatives, alongside the general acceptance of open source, open access and open data, even at government level, that we have a better chance now than we did over the last few decades. Personally I'm inclined to think that all it would take is to present a 'verson 1.0' of some schema for how to implement this and then to open it up to people to take it up, reject it, feed back - but that the ground is fertile for this type of thing to take off once the seed has been planted. |
I'd advocate "ask for forgiveness, not permission" here. I doubt a consensus will be reached. Rather than asking people to come to an agreement on what to do, if mozilla were to pick something reasonable (like the list provided by @bobbledavidson) and a format (yaml, json, anything but xml ;) ) and run with it we can find out what the actual problems are. Simple proposal, specify the name and format of the file and start publicly listing all github repos that have the file. People will start adding it because that's the way of getting on the list, then because it's being used we'll find out what to change. People will still argue about it, but they'll be arguing while an actual implementation exists rather than having nothing. If there's a name for the file and a format, I'll add it to my code today. |
I completely agree with @IanCal. We should just come up with something minimalist but expandable and then present it as e.g. Mozilla Reusable Code Object Standard v1.0 and start making use of it. Then we can have feedback sessions and further congresses to develop the version upgrades and stratifications etc. |
+1 for something is better than nothing. |
Sorry, for jumping in late into the conversation - there have been some very good comments so far. Here's my pragmatic view (personal opinion, may not be shared by others at the SSI!) ABSOLUTE minimum:
USEFUL minimum:
PRAGMATIC TO STRIVE FOR minimum:
IDEALISTIC minimum:
There's already a lot of good work on this. For the idealistic end, the NASA Reuse Readiness Levels are a good read. You might be interested in the description criteria we've used for the Journal of Open Research Software as well: http://openresearchsoftware.metajnl.com/about/editorialPolicies#peerReviewProcess - these are somewhere between my description of USEFUL and PRAGMATIC above. I do like the @bobbledavidson list - my suggestion is that people should use the CRAPL (http://matt.might.net/articles/crapl/) for the "don't blame me" license :-) One thing though, for the categories of "actively developed or not" - my heart says this should be identified through the repository stats rather than the metadata file, though my head says that given the findings of http://firstmonday.org/ojs/index.php/fm/article/view/1477/1392 maybe it is the original author who decides this category. |
+1 to @npch 's comments. i don't disagree with your point about testing, @ctb, but for researchers who are doing a bit of data analysis and aren't accustomed to testing, we still want to nudge them in the right direction, even if it's not perfect practice out the gate. (and yes, agree that without testing, it's not always easy or advised to reuse, but well, we're not going to change everything overnight ... ;) ) |
If the dataset can be generated synthetically with a fixed seed, one can avoid shipping potentially large files (with the accompanying archival/infrastructure issues). I.e, see how we use https://github.com/SciLifeLab/facs/blob/master/tests/test_simngs.py#L39 |
I'd like to see tooling that discovers dependencies and environment information to capture easy guesses to autopopulate metadata fields. Author experience is going to vary wildly and and they may not know the answers to what you want to include, nor have the time to investigate the answers. Sumatra is a project I see that attempts to solve this, https://pythonhosted.org/Sumatra/introduction.html
In addition to what has been discussed so far, Sumatra gathers information about the platform architecture the code is run on, and for support languages it gathers dependencies. R, python, matlab. |
I really want to emphasize tooling. Authors are not going to be able to spend time chasing down all of this useful information. It is a hard problem to get compliance for formal processes in the software industry from people who do this as a full time job. Thus those of us who create tools should bake affordances in to the tools so that all of the information is captured as a side-effect of use. everyone should fund for UX development and testing in grant proposals. reproducibility can be a side effect of usable design. |
@codersquid I agree fully with your call for better tooling support. But let's not forget that tools such as Sumatra don't perform miracles: they obtain metadata from the software packages themselves, and thus rely on conventions and techniques that make it possible to discover them. So we need not only tools but also conventions. Sumatra works because the Python ecosystem has informal de-facto conventions that most packages follow, such as the |
Just two further thoughts on this thread: NSF's recent Dear Colleague letter includes ideas that are very much in this spirit of establishing metadata for code reuse, particularly along the lines of a study that might empirically demonstrate what elements contribute most to effective reuse; see http://www.nsf.gov/pubs/2014/nsf14059/nsf14059.jsp Encourages applicants to apply for exploratory EAGER grants on this topic via SciSIP or SI2 programs. Hope that Mozilla Science Lab and/or others on this thread might consider such an angle? In a different vein, I don't believe it's been mentioned on this thread yet so I thought I might bring up the Science Code Manifesto by @drj11 and others at the climate code foundation as related perspective on this question of minimal metadata. I would describe it as somewhere between @npch 's useful minimum and absolute minimum (archive in repository, state license, citation); though it is more of a cultural guideline than a technical one. |
A couple thoughts. Folks might be interested in this lecture by @victoriastodden: Toward Reproducible Computational Science: Reliability, Re-Use, and Readability - http://www.ischool.berkeley.edu/newsandevents/events/20140409stodden (slides at http://www.stanford.edu/~vcs/talks/BerkeleyISchool-April92014-STODDEN.pdf ) Also, there is also discussion going at http://forum.mozillascience.org/t/what-information-is-needed-to-reuse-code/20 |
The test for sufficient metadata would be: What is needed to reuse data? - there might be easer questions that should be answered first. Finding the correct metadata to render data usage is a great discussion but the connection between published interpretations, figures, applications (transformative algorithms) in python, R, or what ever must align. I believe this could be a faster approach to answering what meta-data do I need to record. I propose that working on coordination of publications, data and the applications that transform the raw data into figures or other must be repeatable as code. Attacking individual problems with this holistic approach will produce a publishing platform that respects text, code, and data; as well as all there implicit and semantic relationships. Our best platforms consider two of these at best. Let's get serious and recognize these languages and their relationships can not be thought of separately! |
(For more, see the full post on the blog: http://mozillascience.org/what-else-is-needed-for-code-reuse/)
When we first started discussions around our latest "Code as a research object" project, one of the main topics that arose was reuse. It's one thing for code and software to have an identifier that the community trusts so that it can be integrated into scholarly publishing systems. But what about the researchers looking to use that information to build or reuse that code in their own work? What information is needed for the code to be picked up, forked and run by someone else outside of their lab? What would the ideal README look like?
A few ideas:
The text was updated successfully, but these errors were encountered: