diff --git a/novice/git/04-open.md b/novice/git/04-open.md
index abdaa2ca7..679f414dc 100644
--- a/novice/git/04-open.md
+++ b/novice/git/04-open.md
@@ -6,182 +6,80 @@ title: Open Science
#### Objectives
-* Explain how the GNU General Public License (GPL) differs from most other open licenses.
-* Explain the four kinds of restrictions that can be combined in a Creative Commons license.
-* Correctly add licensing and citation information to a project repository.
-* Outline options for hosting code and data and the pros and cons of each.
+* Learn how to distribute open source software:
+ - Choosing an appropriate open source license
+ - Choosing an appropriate hosting repository
+* Learn how to distribute open data:
+ - Understand licensing concerns for data and metadata
+ - Choosing an appropriate data repository
-
-The opposite of "open" isn't "closed".
-The opposite of "open" is "broken".
-
-— John Wilbanks
-
-
-Free sharing of information might be the ideal in science,
-but the reality is often more complicated.
-Normal practice today looks something like this:
-
-* A scientist collects some data and stores it on a machine
- that is occasionally backed up by her department.
-* She then writes or modifies a few small programs
- (which also reside on her machine)
- to analyze that data.
-* Once she has some results,
- she writes them up and submits her paper.
- She might include her data---a growing number of journals require this---but
- she probably doesn't include her code.
-* Time passes.
-* The journal sends her reviews written anonymously by a handful of other people in her field.
- She revises her paper to satisfy them,
- during which time she might also modify the scripts she wrote earlier,
- and resubmits.
-* More time passes.
-* The paper is eventually published.
- It might include a link to an online copy of her data,
- but the paper itself will be behind a paywall:
- only people who have personal or institutional access
- will be able to read it.
-
-For a growing number of scientists,
-though,
-the process looks like this:
-
-* The data that the scientist collects is stored in an open access repository
- like [figshare](http://figshare.com/) or [Dryad](http://datadryad.org/)
- as soon as it's collected,
- and given its own DOI.
-* The scientist creates a new repository on GitHub to hold her work.
-* As she does her analysis,
- she pushes changes to her scripts
- (and possibly some output files)
- to that repository.
- She also uses the repository for her paper;
- that repository is then the hub for collaboration with her colleagues.
-* When she's happy with the state of her paper,
- she posts a version to [arXiv](http://arxiv.org/)
- or some other preprint server
- to invite feedback from peers.
-* Based on that feedback,
- she may post several revisions
- before finally submitting her paper to a journal.
-* The published paper includes links to her preprint
- and to her code and data repositories,
- which makes it much easier for other scientists
- to use her work as starting point for their own research.
-
-This open model accelerates discovery:
-the more open work is,
-the more widely it is cited and re-used.
-However,
-people who want to work this way need to make some decisions
-about what exactly "open" means in practice.
-
-### Licensing
-
-The first question is licensing.
-Broadly speaking,
-there are two kinds of open license for software,
-and half a dozen for data and publications.
-For software,
-people can choose between the [GNU General Public License](http://opensource.org/licenses/GPL-3.0) (GPL) on the one hand,
-and licenses like the [MIT](http://opensource.org/licenses/MIT)
-and [BSD](http://opensource.org/licenses/BSD-2-Clause) licenses on the other.
-All of these licenses allow unrestricted sharing and modification of programs,
-but the GPL is [infective](../../gloss.html#infective-license):
-anyone who distributes a modified version of the code
-(or anything that includes GPL'd code)
-must make *their* code freely available as well.
-
-Proponents of the GPL argue that this requirement is needed
-to ensure that people who are benefiting from freely-available code
-are also contributing back to the community.
-Opponents counter that many open source projects have had long and successful lives
-without this condition,
-and that the GPL makes it more difficult to combine code from different sources.
-At the end of the day,
-what matters most is that:
-
-1. every project have a file in its home directory
- called something like `LICENSE` or `LICENSE.txt`
- that clearly states what the license is, and
-2. people use existing licenses rather than writing new ones.
-
-The second point is as important as the first:
-most scientists are not lawyers,
-so wording that may seem sensible to a layperson
-may have unintended gaps or consequences.
-The [Open Source Initiative](http://opensource.org/)
-maintains a list of open source licenses,
-and [tl;drLegal](http://www.tldrlegal.com/) explains many of them in plain English.
-
-When it comes to data, publications, and the like,
-scientists have many more options to choose from.
-The good news is that an organization called [Creative Commons](http://creativecommons.org/)
-has prepared a set of licenses using combinations of four basic restrictions:
-
-* Attribution: derived works must give the original author credit for their work.
-* No Derivatives: people may copy the work, but must pass it along unchanged.
-* Share Alike: derivative works must license their work under the same terms as the original.
-* Noncommercial: free use is allowed, but commercial use is not.
-
-These four restrictions are abbreviated "BY", "ND", "SA", and "NC" respectively,
-so "CC-BY-ND" means,
-"People can re-use the work both for free and commercially,
-but cannot make changes and must cite the original."
-These [short descriptions](http://creativecommons.org/licenses/)
-summarize the six CC licenses in plain language,
-and include links to their full legal formulations.
-
-There is one other important license that doesn't fit into this categorization.
-Scientists (and other people) can choose to put material in the public domain,
-which is often abbreviated "PD".
-In this case,
-anyone can do anything they want with it,
-without needing to cite the original
-or restrict further re-use.
-The table below shows how the six Creative Commons licenses and PD relate to one another:
-
-
-
- Licenses that can be used for derivative work or adaptation
-
- | Original work | by | by-nc | by-nc-nd | by-nc-sa | by-nd | by-sa | pd |
-
-
- | by | X | X | X | X | X | X | |
-
-
- | by-nc | | X | X | X | | | |
-
-
- | by-nc-nd | | | | | | | |
-
-
- | by-nc-sa | | | | X | | | |
-
-
- | by-nd | | | | | | | |
-
-
- | by-sa | | | | | | X | |
-
-
- | pd | X | X | X | X | X | X | X |
-
-
-
-[Software Carpentry](http://software-carpentry.org/license.html)
-uses CC-BY for its lessons and the MIT License for its code
-in order to encourage the widest possible re-use.
-Again,
-the most important thing is for the `LICENSE` file in the root directory of your project
-to state clearly what your license is.
-You may also want to include a file called `CITATION` or `CITATION.txt`
-that describes how to reference your project;
+
+Knowing how to effectively publish and distribute open source software
+and open data is becoming as important to scientific research as publishing
+papers -- indeed, it is already required by many of the most prestigious
+journals. In this lesson we focus on the two key components to publishing
+data or source code: licensing and repositories.
+
+
+## Open Source Software ##
+
+### Licensing Software ###
+
+Open source licenses assist the creator of a creative work in waiving
+some of the rights and privileges which they are automatically granted
+under [_copyright_ law](http://en.wikipedia.org/wiki/Copyright).
+
+
+Broadly speaking, there are two kinds of open license for
+software: [copyleft](http://en.wikipedia.org/wiki/Copyleft)
+licenses such as the [GNU General Public
+Licenses](http://opensource.org/licenses/GPL-3.0) (GPL), and
+[permissive](http://en.wikipedia.org/wiki/Permissive_free_software_licence)
+licenses such as the [MIT](http://opensource.org/licenses/MIT) and
+[BSD](http://opensource.org/licenses/BSD-2-Clause) licenses. All of these
+licenses allow unrestricted sharing and modification of programs, but
+copyleft licenses are [infective](../../gloss.html#infective-license):
+anyone who distributes a modified version of the code (or anything
+that includes GPL'd code) must make *their* code freely available as
+well. Code under permissive licenses has no such clause, and as such can
+be more [easily re-used in commercial software](http://nipy.sourceforge.net/nipy/devel/faq/johns_bsd_pitch.html).
+
+#### How to apply a license ####
+
+Before releasing open source software you should confirm with your
+employer that you are the current copyright holder (in academic settings,
+faculty tend to control their own copyrights while the copyrights of work
+done by staff often belong to the university).
+
+Software licenses are typically applied by including a plain-text file
+with name such as `LICENSE` or `COPYING` in the project directory.
+Some projects will place the full text of the license in comments at
+the top of every source file, while others may only declare the choice
+of license by an abbreviation and/or a link to the license terms.
+
+The legal text for most open source licenses can be found from the [Open
+Source Initiative](http://opensource.org/), which maintains a list of
+open source licenses which have gone through their approval process.
+[tl;drLegal](http://www.tldrlegal.com/) explains many of them in plain
+English.
+
+When selecting a license, be sure that your choice is consistent with
+the terms of any software you may be reusing or modifying (usually by
+adopting the license already in use). Note that many licenses have
+multiple versions which are not necessarily compatible, so be sure to
+be explicit.
+
+
+------------------
+
+[Software Carpentry](http://software-carpentry.org/license.html) uses
+CC-BY for its lessons and the MIT License for its code in order to
+encourage the widest possible re-use. Again, the most important thing
+is for the `LICENSE` file in the root directory of your project to state
+clearly what your license is. You may also want to include a file called
+`CITATION` or `CITATION.txt` that describes how to reference your project;
the one for Software Carpentry states:
@@ -201,74 +99,187 @@ Greg Wilson: "Software Carpentry: Lessons Learned". arXiv:1307.5448, July 2013.
~~~
-### Hosting
-
-The second big question for groups that want to open up their work
-is where to host their code and data.
-One option is for the lab, the department, or the university to provide a server,
-manage accounts and backups,
-and so on.
-The main benefit of this is that it clarifies who owns what,
-which is particularly important if any of the material is sensitive
-(i.e.,
-relates to experiments involving human subjects
-or may be used in a patent application).
-The main drawbacks are the cost of providing the service and its longevity:
-a scientist who has spent ten years collecting data
-would like to be sure that data will still be available ten years from now,
-but that's well beyond the lifespan of most of the grants that fund academic infrastructure.
-
-Another option is to purchase a domain
-and pay an Internet service provider (ISP) to host it.
-This gives the individual or group more control,
-and sidesteps problems that can arise when moving from one institution to another,
-but requires more time and effort to set up than either
-the option above or the option below.
-
-The third option is to use a public hosting service like [GitHub](http://github.com),
-[BitBucket](http://bitbucket.org),
-[Google Code](http://code.google.com),
-or [SourceForge](http://sourceforge.net).
-All of these allow people to create repositories through a web interface,
-and also provide mailing lists,
-ways to keep track of who's doing what,
-and so on.
-They all benefit from economies of scale and network effects:
-it's easier to run one large service well
-than to run many smaller services to the same standard,
-and it's also easier for people to collaborate if they're using the same service,
-not least because it gives them fewer passwords to remember.
-
-However,
-all of these services place some constraints on people's work.
-In particular,
-most give users a choice:
-if they're willing to share their work with others,
-it will be hosted for free,
-but if they want privacy,
-they may have to pay.
-Sharing might seem like the only valid choice for science,
-but many institutions may not allow researchers to do this,
-either because they want to protect future patent applications
-or simply because what's new is often also frightening.
+### Hosting & Distributing Software ###
+
+Open Source research software is best distributed through the use of a
+dedicated code repository or academic data archive. Most (but not all)
+code repositories are built around the use of a version control system
+such as `git` or `subversion`, which creates some barrier to entry
+(fortunately you've just completed the `git` SWC lessons!)
+
+Public hosting services such as [GitHub](http://github.com),
+[BitBucket](http://bitbucket.org), [Google Code](http://code.google.com),
+or [SourceForge](http://sourceforge.net) are feature rich,
+user friendly and widely adopted options. All provide free
+hosting for open-source projects (and usually a limited number
+of free private projects as well). See other recommendations
+for code repositories from the [Journal for Open Research
+Software](http://openresearchsoftware.metajnl.com/about/editorialPolicies#custom-0).
+
+Researchers may also choose to distribute software through dedicated
+language repositories such as CRAN (R). These language-specific
+repositories host only code that is ready for use and will usually make
+it easier for other users to install your software. These repositories
+also archive versions as they are released, but typically do not
+require using version management software. These repositories often
+have stricter criteria than the public hosting services described
+above, so be sure to consult the appropriate policies (e.g. [CRAN
+policies](cran.r-project.org/web/packages/policies.html)) before
+proceeding. Many projects host their daily development on public hosting
+services while also distributing releases through a system such as this.
+
+
+It has been common practice for researchers to host software they develop
+on computer servers managed by their lab, department, or institution.
+Experience has shown that software and other resources hosted in this
+fashion has a much higher rate of link rot, where changes to websites,
+changing jobs, or other factors make it unlikely that these resources are
+still available years later. These options also typically lack many of
+the features dedicated software repositories provide. Online supplement
+sections of journals are also not ideal mechanisms to distribute
+software, for many of the same reasons. It is best to simply link from
+your publication or personal website to the permanent software repository.
+
+make it easy to run a Github-like environment on a private server but
+may not be as well suited for long-term hosting as the larger dedicated
+hosting services. -->
+
+Some scientific data archives will also host software. Because
+these archives are backed by long-term redundant archiving
+(e.g. [CLOCKSS](http://clockss.org/)) and permanent identifiers
+(e.g. [DOIs](http://en.wikipedia.org/wiki/Digital_object_identifier)),
+they offer a more long-term archival storage solution (see Archiving
+Data, below). The data repositories [zenodo](https://zenodo.org) and
+[figshare](http://figshare.com) currently have automated [integration
+with Github](http://collaborate.mozillascience.org/projects/codemeta)
+to facilitate this.
+
+
+## Open Data ##
+
+Learners should know how to publish open data effectively, whether
+or not they choose to do so in any particular circumstance. -->
+
+### Licensing Data ###
+
+Unlike software or other creative works, data are considered facts and
+generally not subject to copyright. Many academic data repositories
+underscore this by requiring a public-domain declaration such as
+[Creative Commons Zero](https://creativecommons.org/about/cc0) (or CC0,
+not technically a license) for all data that they host (see [Panton
+Principles](http://pantonprinciples.org) of open data.) Even when placing
+data or other work in the public domain it is preferable to use a standard
+declaration such as CC0, since writing an internationally valid legal
+document is a task best left to the relevant experts.
+
+Data formats, descriptions, or databases are considered creative works
+and are frequently accompanied by a copyright statement. Creative Commons
+provides a suite of licenses to waive various aspects of copyright in
+order to facilitate open reuse. The most permissive of these is the
+_Attribution_ or CC-BY license. Alternatives may restrict commercial use
+(NC: non-commercial), restrict derivative products (ND: no-derivatives),
+or include the copyleft clause (SA: share-alike) similar to the GPL. Different
+licenses offer any combination of the latter three clauses on top of the
+default BY clause.
+
+It is worth noting that only the CC-BY license
+is considered compatible with the widely recognized Budapest Open Access
+Initiative [definition](en.wikipedia.org/wiki/Budapest_Open_Access_Initiative).
+Several studies have shown that researchers choose the more restrictive
+variations by default and are unaware of the limitations this may
+place on uses they condone, such as education.
+
+
+### Archiving & Distributing Data ###
+
+Many journals now require authors to deposit all data supporting published
+results into a scientific data repository. As with software repositories,
+data repositories are better suited for sharing data than hosting on one's own
+website or in a journal's supplemental online materials.
+
+Scientific data repositories may be divided into two types: those
+accepting only published data accompanying a scientific article
+(e.g. [Dryad](http://datadryad.org), and those that also accept data
+that is not (or not yet) associated with any particular publication
+([Zenodo](http://zenodo.org), [figshare](http://figshare.com)). Some
+repositories focus on narrow subject areas or data types, while others
+are more general purpose. Consult the policies of your journals,
+discipline-specific literature on data archiving, and the policies of the
+data archives themselves in finding a good match. The [recommendations
+from _Nature_](http://www.nature.com/sdata/data-policies/repositories)
+are one good place to start.
+
+
+Data repositories provide many advantages, including:
+
+
+- **Permanent identifiers:** Though widely touted as making your data
+'citable', permanent identifiers are designed to avoid link rot that
+results from changing URLs (hence the name). The [Digital Object
+Identifier](http://en.wikipedia.org/wiki/Digital_object_identifier),
+or DOI is the best known because of its association with scientific
+publications.[^1] An object with a DOI number can be found by entering
+the number into a central registry, [http://doi.org](http://doi.org),
+regardless of the URL address currently hosting it. Repositories must pay
+a small fee for each DOI. If a repository fails to update the records
+allowing the DOI to resolve to the correct resource, the DOI provider
+may refuse to sell them additional DOIs.
+
+- **Metadata & data discovery**: Data repositories collect basic metadata
+such as author and subject information. This facilitates search and
+discovery of relevant datasets. DOI-based repositories submit much
+of this information in a standardized format to the central registry
+at DataCite, which allows tools and researchers to search across all
+DataCite repositories at once.
+
+- **Data management** Data repositories are well equipped to
+provide redundant and reliable access to data over the long
+term. Data can be updated or corrected while maintaining links
+to the original versions. Looks great on [Data management
+plans](http://www.nsf.gov/eng/general/dmp.jsp).
+
+[^1]: Technically Data DOIs are different than scientific publication
+DOIs, in that the former are administered by DataCite and the latter by
+CrossRef, and as such include slightly different metadata and protocols.
+
+#### Special cases ####
+
+Data security concerns are not a good reason to be lazy about data archiving.
+Sensitive data (e.g. human experimental subjects) should always be
+dealt with as such, following appropriate anonymization and/or security
+protocols defined before the data is collected. Many repositories have
+explicit mechanisms in place to to handle sensitive data appropriately.
+Storing sensitive data on personal machines without clear security policies in
+place may be inappropriate.
+
+Rapidly updated, streaming, or very large datasets (usually >2-10
+GB) still pose challenges for most general purpose scientific data
+repositories.
-#### Key Points
-* Open scientific work is more useful and more highly cited than closed.
-* People who incorporate GPL'd software into theirs must make theirs open;
- most other open licenses do not require this.
-* The Creative Commons family of licenses allow people to mix and match
- requirements and restrictions on attribution,
- creation of derivative works,
- further sharing,
- and commercialization.
-* People who are not lawyers should not try to write licenses from scratch.
-* Projects can be hosted on university servers,
- on personal domains,
- or on public forges.
-* Rules regarding intellectual property and storage of sensitive information apply
- no matter where code and data are hosted.
+## Key Points ##
+
+* Open source licenses include both permissive (BSD, MIT) and copyleft
+(GPL) style licenses. Anyone distributing software with code taken from
+or modified from code under a GPL style-license must make their derivative
+source code available under the same terms.
+
+* Open data should be placed in the public domain using the CC0
+declaration (copyright not being applicable to facts).
+
+* Dedicated software repositories such as [GitHub](http://github.com),
+[BitBucket](http://bitbucket.org), [Google Code](http://code.google.com),
+or [SourceForge](http://sourceforge.net) are preferable to self hosting software.
+
+- Other creative works, including data descriptions and publications, can
+use Creative Commons licenses to facilitate reuse. The most permissive
+license, CC-BY, corresponds with community definitions of Open Access,
+while others are more restrictive.
+
+- Dedicated scientific data repositories, such as those integrated with
+DataCite (e.g. any that provide DOIs), are the preferable mechanism for
+data archiving.