Skip to content

Cleanup entry "Move DOIs from note and URL field to DOI field and remove http prefix" incorrectly recognizies urls ending with "2010/stuff" as DOIs #6880

@JasonGross

Description

@JasonGross

JabRef version 5.2--2020-09-06--c0b139a on Windows 10 10.0 amd64, Java 14.0.2

Steps to reproduce the behavior:

  1. Save the file
@Misc{TrustedSlind,
  author   = {Konrad Slind},
  title    = {Trusted Extensions of Interactive Theorem Provers: Workshop Summary},
  date     = {2010-08},
  location = {Cambridge, England},
  url      = {http://www.cs.utexas.edu/users/kaufmann/itp-trusted-extensions-aug-2010/summary/summary.pdf},
}

as a .bib file.

  1. Open this file in JabRef
  2. Click on the one entry to select it
  3. Click Quality -> Cleanup entries / Alt+F8
  4. Ensure that only the first item ("Move DOIs from note and URL field to DOI field and remove http prefix") is checked
  5. Click OK
  6. Double-click on the entry and click "BibTeX source"

Note that the new source is

@Misc{TrustedSlind,
  author   = {Konrad Slind},
  title    = {Trusted Extensions of Interactive Theorem Provers: Workshop Summary},
  date     = {2010-08},
  doi      = {10/summary},
  location = {Cambridge, England},
}

This url is not a DOI link, though! Presumably this is because the matcher code at

// Regex
// (see http://www.doi.org/doi_handbook/2_Numbering.html)
private static final String DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "(?:\\.[0-9]+)+" // registrant codes
+ "[/:%]" // divider
+ "(?:.+)" // suffix alphanumeric string
+ ")"; // end group \1
private static final String FIND_DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "(?:\\.[0-9]+)+" // registrant codes
+ "[/:]" // divider
+ "(?:[^\\s]+)" // suffix alphanumeric without space
+ ")"; // end group \1
// Regex (Short DOI)
private static final String SHORT_DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "[/:%]" // divider
+ "[a-zA-Z0-9]+"
+ ")"; // end group \1
private static final String FIND_SHORT_DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "[/:]" // divider
+ "[a-zA-Z0-9]+"
+ "(?:[^\\s]+)" // suffix alphanumeric without space
+ ")"; // end group \1
private static final String HTTP_EXP = "https?://[^\\s]+?" + DOI_EXP;
private static final String SHORT_DOI_HTTP_EXP = "https?://[^\\s]+?" + SHORT_DOI_EXP;
// Pattern
private static final Pattern EXACT_DOI_PATT = Pattern.compile("^(?:https?://[^\\s]+?)?" + DOI_EXP + "$", Pattern.CASE_INSENSITIVE);
private static final Pattern DOI_PATT = Pattern.compile("(?:https?://[^\\s]+?)?" + FIND_DOI_EXP, Pattern.CASE_INSENSITIVE);
// Pattern (short DOI)
private static final Pattern EXACT_SHORT_DOI_PATT = Pattern.compile("^(?:https?://[^\\s]+?)?" + SHORT_DOI_EXP, Pattern.CASE_INSENSITIVE);
private static final Pattern SHORT_DOI_PATT = Pattern.compile("(?:https?://[^\\s]+?)?" + FIND_SHORT_DOI_EXP, Pattern.CASE_INSENSITIVE);

considers all non-space text starting with http:// or https://, followed by 10/ followed by any non-space text, to be a DOI. This is absurd. The character immediately preceding the 10, doi:, or urn: should at the very least be required to be a url separator character such as /, :, ?, &, or =.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions