diff --git a/novice/git/04-open.md b/novice/git/04-open.md index abdaa2ca7..679f414dc 100644 --- a/novice/git/04-open.md +++ b/novice/git/04-open.md @@ -6,182 +6,80 @@ title: Open Science
#### Objectives -* Explain how the GNU General Public License (GPL) differs from most other open licenses. -* Explain the four kinds of restrictions that can be combined in a Creative Commons license. -* Correctly add licensing and citation information to a project repository. -* Outline options for hosting code and data and the pros and cons of each. +* Learn how to distribute open source software: + - Choosing an appropriate open source license + - Choosing an appropriate hosting repository +* Learn how to distribute open data: + - Understand licensing concerns for data and metadata + - Choosing an appropriate data repository
-
-The opposite of "open" isn't "closed". -The opposite of "open" is "broken". -
-— John Wilbanks -
- -Free sharing of information might be the ideal in science, -but the reality is often more complicated. -Normal practice today looks something like this: - -* A scientist collects some data and stores it on a machine - that is occasionally backed up by her department. -* She then writes or modifies a few small programs - (which also reside on her machine) - to analyze that data. -* Once she has some results, - she writes them up and submits her paper. - She might include her data---a growing number of journals require this---but - she probably doesn't include her code. -* Time passes. -* The journal sends her reviews written anonymously by a handful of other people in her field. - She revises her paper to satisfy them, - during which time she might also modify the scripts she wrote earlier, - and resubmits. -* More time passes. -* The paper is eventually published. - It might include a link to an online copy of her data, - but the paper itself will be behind a paywall: - only people who have personal or institutional access - will be able to read it. - -For a growing number of scientists, -though, -the process looks like this: - -* The data that the scientist collects is stored in an open access repository - like [figshare](http://figshare.com/) or [Dryad](http://datadryad.org/) - as soon as it's collected, - and given its own DOI. -* The scientist creates a new repository on GitHub to hold her work. -* As she does her analysis, - she pushes changes to her scripts - (and possibly some output files) - to that repository. - She also uses the repository for her paper; - that repository is then the hub for collaboration with her colleagues. -* When she's happy with the state of her paper, - she posts a version to [arXiv](http://arxiv.org/) - or some other preprint server - to invite feedback from peers. -* Based on that feedback, - she may post several revisions - before finally submitting her paper to a journal. -* The published paper includes links to her preprint - and to her code and data repositories, - which makes it much easier for other scientists - to use her work as starting point for their own research. - -This open model accelerates discovery: -the more open work is, -the more widely it is cited and re-used. -However, -people who want to work this way need to make some decisions -about what exactly "open" means in practice. - -### Licensing - -The first question is licensing. -Broadly speaking, -there are two kinds of open license for software, -and half a dozen for data and publications. -For software, -people can choose between the [GNU General Public License](http://opensource.org/licenses/GPL-3.0) (GPL) on the one hand, -and licenses like the [MIT](http://opensource.org/licenses/MIT) -and [BSD](http://opensource.org/licenses/BSD-2-Clause) licenses on the other. -All of these licenses allow unrestricted sharing and modification of programs, -but the GPL is [infective](../../gloss.html#infective-license): -anyone who distributes a modified version of the code -(or anything that includes GPL'd code) -must make *their* code freely available as well. - -Proponents of the GPL argue that this requirement is needed -to ensure that people who are benefiting from freely-available code -are also contributing back to the community. -Opponents counter that many open source projects have had long and successful lives -without this condition, -and that the GPL makes it more difficult to combine code from different sources. -At the end of the day, -what matters most is that: - -1. every project have a file in its home directory - called something like `LICENSE` or `LICENSE.txt` - that clearly states what the license is, and -2. people use existing licenses rather than writing new ones. - -The second point is as important as the first: -most scientists are not lawyers, -so wording that may seem sensible to a layperson -may have unintended gaps or consequences. -The [Open Source Initiative](http://opensource.org/) -maintains a list of open source licenses, -and [tl;drLegal](http://www.tldrlegal.com/) explains many of them in plain English. - -When it comes to data, publications, and the like, -scientists have many more options to choose from. -The good news is that an organization called [Creative Commons](http://creativecommons.org/) -has prepared a set of licenses using combinations of four basic restrictions: - -* Attribution: derived works must give the original author credit for their work. -* No Derivatives: people may copy the work, but must pass it along unchanged. -* Share Alike: derivative works must license their work under the same terms as the original. -* Noncommercial: free use is allowed, but commercial use is not. - -These four restrictions are abbreviated "BY", "ND", "SA", and "NC" respectively, -so "CC-BY-ND" means, -"People can re-use the work both for free and commercially, -but cannot make changes and must cite the original." -These [short descriptions](http://creativecommons.org/licenses/) -summarize the six CC licenses in plain language, -and include links to their full legal formulations. - -There is one other important license that doesn't fit into this categorization. -Scientists (and other people) can choose to put material in the public domain, -which is often abbreviated "PD". -In this case, -anyone can do anything they want with it, -without needing to cite the original -or restrict further re-use. -The table below shows how the six Creative Commons licenses and PD relate to one another: - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Licenses that can be used for derivative work or adaptation
Original work by by-nc by-nc-nd by-nc-sa by-nd by-sa pd
by X X X X X X
by-nc X X X
by-nc-nd
by-nc-sa X
by-nd
by-sa X
pd X X X X X X X
- -[Software Carpentry](http://software-carpentry.org/license.html) -uses CC-BY for its lessons and the MIT License for its code -in order to encourage the widest possible re-use. -Again, -the most important thing is for the `LICENSE` file in the root directory of your project -to state clearly what your license is. -You may also want to include a file called `CITATION` or `CITATION.txt` -that describes how to reference your project; + +Knowing how to effectively publish and distribute open source software +and open data is becoming as important to scientific research as publishing +papers -- indeed, it is already required by many of the most prestigious +journals. In this lesson we focus on the two key components to publishing +data or source code: licensing and repositories. + + +## Open Source Software ## + +### Licensing Software ### + +Open source licenses assist the creator of a creative work in waiving +some of the rights and privileges which they are automatically granted +under [_copyright_ law](http://en.wikipedia.org/wiki/Copyright). + + +Broadly speaking, there are two kinds of open license for +software: [copyleft](http://en.wikipedia.org/wiki/Copyleft) +licenses such as the [GNU General Public +Licenses](http://opensource.org/licenses/GPL-3.0) (GPL), and +[permissive](http://en.wikipedia.org/wiki/Permissive_free_software_licence) +licenses such as the [MIT](http://opensource.org/licenses/MIT) and +[BSD](http://opensource.org/licenses/BSD-2-Clause) licenses. All of these +licenses allow unrestricted sharing and modification of programs, but +copyleft licenses are [infective](../../gloss.html#infective-license): +anyone who distributes a modified version of the code (or anything +that includes GPL'd code) must make *their* code freely available as +well. Code under permissive licenses has no such clause, and as such can +be more [easily re-used in commercial software](http://nipy.sourceforge.net/nipy/devel/faq/johns_bsd_pitch.html). + +#### How to apply a license #### + +Before releasing open source software you should confirm with your +employer that you are the current copyright holder (in academic settings, +faculty tend to control their own copyrights while the copyrights of work +done by staff often belong to the university). + +Software licenses are typically applied by including a plain-text file +with name such as `LICENSE` or `COPYING` in the project directory. +Some projects will place the full text of the license in comments at +the top of every source file, while others may only declare the choice +of license by an abbreviation and/or a link to the license terms. + +The legal text for most open source licenses can be found from the [Open +Source Initiative](http://opensource.org/), which maintains a list of +open source licenses which have gone through their approval process. +[tl;drLegal](http://www.tldrlegal.com/) explains many of them in plain +English. + +When selecting a license, be sure that your choice is consistent with +the terms of any software you may be reusing or modifying (usually by +adopting the license already in use). Note that many licenses have +multiple versions which are not necessarily compatible, so be sure to +be explicit. + + +------------------ + +[Software Carpentry](http://software-carpentry.org/license.html) uses +CC-BY for its lessons and the MIT License for its code in order to +encourage the widest possible re-use. Again, the most important thing +is for the `LICENSE` file in the root directory of your project to state +clearly what your license is. You may also want to include a file called +`CITATION` or `CITATION.txt` that describes how to reference your project; the one for Software Carpentry states:
@@ -201,74 +99,187 @@ Greg Wilson: "Software Carpentry: Lessons Learned". arXiv:1307.5448, July 2013. ~~~
-### Hosting - -The second big question for groups that want to open up their work -is where to host their code and data. -One option is for the lab, the department, or the university to provide a server, -manage accounts and backups, -and so on. -The main benefit of this is that it clarifies who owns what, -which is particularly important if any of the material is sensitive -(i.e., -relates to experiments involving human subjects -or may be used in a patent application). -The main drawbacks are the cost of providing the service and its longevity: -a scientist who has spent ten years collecting data -would like to be sure that data will still be available ten years from now, -but that's well beyond the lifespan of most of the grants that fund academic infrastructure. - -Another option is to purchase a domain -and pay an Internet service provider (ISP) to host it. -This gives the individual or group more control, -and sidesteps problems that can arise when moving from one institution to another, -but requires more time and effort to set up than either -the option above or the option below. - -The third option is to use a public hosting service like [GitHub](http://github.com), -[BitBucket](http://bitbucket.org), -[Google Code](http://code.google.com), -or [SourceForge](http://sourceforge.net). -All of these allow people to create repositories through a web interface, -and also provide mailing lists, -ways to keep track of who's doing what, -and so on. -They all benefit from economies of scale and network effects: -it's easier to run one large service well -than to run many smaller services to the same standard, -and it's also easier for people to collaborate if they're using the same service, -not least because it gives them fewer passwords to remember. - -However, -all of these services place some constraints on people's work. -In particular, -most give users a choice: -if they're willing to share their work with others, -it will be hosted for free, -but if they want privacy, -they may have to pay. -Sharing might seem like the only valid choice for science, -but many institutions may not allow researchers to do this, -either because they want to protect future patent applications -or simply because what's new is often also frightening. +### Hosting & Distributing Software ### + +Open Source research software is best distributed through the use of a +dedicated code repository or academic data archive. Most (but not all) +code repositories are built around the use of a version control system +such as `git` or `subversion`, which creates some barrier to entry +(fortunately you've just completed the `git` SWC lessons!) + +Public hosting services such as [GitHub](http://github.com), +[BitBucket](http://bitbucket.org), [Google Code](http://code.google.com), +or [SourceForge](http://sourceforge.net) are feature rich, +user friendly and widely adopted options. All provide free +hosting for open-source projects (and usually a limited number +of free private projects as well). See other recommendations +for code repositories from the [Journal for Open Research +Software](http://openresearchsoftware.metajnl.com/about/editorialPolicies#custom-0). + +Researchers may also choose to distribute software through dedicated +language repositories such as CRAN (R). These language-specific +repositories host only code that is ready for use and will usually make +it easier for other users to install your software. These repositories +also archive versions as they are released, but typically do not +require using version management software. These repositories often +have stricter criteria than the public hosting services described +above, so be sure to consult the appropriate policies (e.g. [CRAN +policies](cran.r-project.org/web/packages/policies.html)) before +proceeding. Many projects host their daily development on public hosting +services while also distributing releases through a system such as this. + + +It has been common practice for researchers to host software they develop +on computer servers managed by their lab, department, or institution. +Experience has shown that software and other resources hosted in this +fashion has a much higher rate of link rot, where changes to websites, +changing jobs, or other factors make it unlikely that these resources are +still available years later. These options also typically lack many of +the features dedicated software repositories provide. Online supplement +sections of journals are also not ideal mechanisms to distribute +software, for many of the same reasons. It is best to simply link from +your publication or personal website to the permanent software repository. + +make it easy to run a Github-like environment on a private server but +may not be as well suited for long-term hosting as the larger dedicated +hosting services. --> + +Some scientific data archives will also host software. Because +these archives are backed by long-term redundant archiving +(e.g. [CLOCKSS](http://clockss.org/)) and permanent identifiers +(e.g. [DOIs](http://en.wikipedia.org/wiki/Digital_object_identifier)), +they offer a more long-term archival storage solution (see Archiving +Data, below). The data repositories [zenodo](https://zenodo.org) and +[figshare](http://figshare.com) currently have automated [integration +with Github](http://collaborate.mozillascience.org/projects/codemeta) +to facilitate this. + + +## Open Data ## + +Learners should know how to publish open data effectively, whether +or not they choose to do so in any particular circumstance. --> + +### Licensing Data ### + +Unlike software or other creative works, data are considered facts and +generally not subject to copyright. Many academic data repositories +underscore this by requiring a public-domain declaration such as +[Creative Commons Zero](https://creativecommons.org/about/cc0) (or CC0, +not technically a license) for all data that they host (see [Panton +Principles](http://pantonprinciples.org) of open data.) Even when placing +data or other work in the public domain it is preferable to use a standard +declaration such as CC0, since writing an internationally valid legal +document is a task best left to the relevant experts. + +Data formats, descriptions, or databases are considered creative works +and are frequently accompanied by a copyright statement. Creative Commons +provides a suite of licenses to waive various aspects of copyright in +order to facilitate open reuse. The most permissive of these is the +_Attribution_ or CC-BY license. Alternatives may restrict commercial use +(NC: non-commercial), restrict derivative products (ND: no-derivatives), +or include the copyleft clause (SA: share-alike) similar to the GPL. Different +licenses offer any combination of the latter three clauses on top of the +default BY clause. + +It is worth noting that only the CC-BY license +is considered compatible with the widely recognized Budapest Open Access +Initiative [definition](en.wikipedia.org/wiki/Budapest_Open_Access_Initiative). +Several studies have shown that researchers choose the more restrictive +variations by default and are unaware of the limitations this may +place on uses they condone, such as education. + + +### Archiving & Distributing Data ### + +Many journals now require authors to deposit all data supporting published +results into a scientific data repository. As with software repositories, +data repositories are better suited for sharing data than hosting on one's own +website or in a journal's supplemental online materials. + +Scientific data repositories may be divided into two types: those +accepting only published data accompanying a scientific article +(e.g. [Dryad](http://datadryad.org), and those that also accept data +that is not (or not yet) associated with any particular publication +([Zenodo](http://zenodo.org), [figshare](http://figshare.com)). Some +repositories focus on narrow subject areas or data types, while others +are more general purpose. Consult the policies of your journals, +discipline-specific literature on data archiving, and the policies of the +data archives themselves in finding a good match. The [recommendations +from _Nature_](http://www.nature.com/sdata/data-policies/repositories) +are one good place to start. + + +Data repositories provide many advantages, including: + + +- **Permanent identifiers:** Though widely touted as making your data +'citable', permanent identifiers are designed to avoid link rot that +results from changing URLs (hence the name). The [Digital Object +Identifier](http://en.wikipedia.org/wiki/Digital_object_identifier), +or DOI is the best known because of its association with scientific +publications.[^1] An object with a DOI number can be found by entering +the number into a central registry, [http://doi.org](http://doi.org), +regardless of the URL address currently hosting it. Repositories must pay +a small fee for each DOI. If a repository fails to update the records +allowing the DOI to resolve to the correct resource, the DOI provider +may refuse to sell them additional DOIs. + +- **Metadata & data discovery**: Data repositories collect basic metadata +such as author and subject information. This facilitates search and +discovery of relevant datasets. DOI-based repositories submit much +of this information in a standardized format to the central registry +at DataCite, which allows tools and researchers to search across all +DataCite repositories at once. + +- **Data management** Data repositories are well equipped to +provide redundant and reliable access to data over the long +term. Data can be updated or corrected while maintaining links +to the original versions. Looks great on [Data management +plans](http://www.nsf.gov/eng/general/dmp.jsp). + +[^1]: Technically Data DOIs are different than scientific publication +DOIs, in that the former are administered by DataCite and the latter by +CrossRef, and as such include slightly different metadata and protocols. + +#### Special cases #### + +Data security concerns are not a good reason to be lazy about data archiving. +Sensitive data (e.g. human experimental subjects) should always be +dealt with as such, following appropriate anonymization and/or security +protocols defined before the data is collected. Many repositories have +explicit mechanisms in place to to handle sensitive data appropriately. +Storing sensitive data on personal machines without clear security policies in +place may be inappropriate. + +Rapidly updated, streaming, or very large datasets (usually >2-10 +GB) still pose challenges for most general purpose scientific data +repositories.
-#### Key Points -* Open scientific work is more useful and more highly cited than closed. -* People who incorporate GPL'd software into theirs must make theirs open; - most other open licenses do not require this. -* The Creative Commons family of licenses allow people to mix and match - requirements and restrictions on attribution, - creation of derivative works, - further sharing, - and commercialization. -* People who are not lawyers should not try to write licenses from scratch. -* Projects can be hosted on university servers, - on personal domains, - or on public forges. -* Rules regarding intellectual property and storage of sensitive information apply - no matter where code and data are hosted. +## Key Points ## + +* Open source licenses include both permissive (BSD, MIT) and copyleft +(GPL) style licenses. Anyone distributing software with code taken from +or modified from code under a GPL style-license must make their derivative +source code available under the same terms. + +* Open data should be placed in the public domain using the CC0 +declaration (copyright not being applicable to facts). + +* Dedicated software repositories such as [GitHub](http://github.com), +[BitBucket](http://bitbucket.org), [Google Code](http://code.google.com), +or [SourceForge](http://sourceforge.net) are preferable to self hosting software. + +- Other creative works, including data descriptions and publications, can +use Creative Commons licenses to facilitate reuse. The most permissive +license, CC-BY, corresponds with community definitions of Open Access, +while others are more restrictive. + +- Dedicated scientific data repositories, such as those integrated with +DataCite (e.g. any that provide DOIs), are the preferable mechanism for +data archiving.