Priority fetcher #3882

stefan-kolb · 2018-03-23T15:15:52Z

all fetchers are run in parallel now
the result with the highest priority wins
InterruptedException | ExecutionException | CancellationException ignore is more critical; not sure if the code behaves as we want it everytime

Trust Levels:

SOURCE (highest)
PUBLISHER
PREPRINT
META_SEARCH
UNKNOWN

Current trust levels:

DOI: SOURCE
ScienceDircect: Publisher
Springer: Publisher
ACS: Publisher
IEEE: Publisher
Google Scholar: META_SEARCH
Arxiv: PREPRINT
OpenAccessDOI: META_SEARCH

…3879

Siedlerchr · 2018-03-28T10:36:19Z

src/main/java/org/jabref/gui/externalfiles/DownloadExternalFile.java

-        tmp.deleteOnExit();
-
-        URLDownload udl = new URLDownload(url);
+        final File tempFile = File.createTempFile("jabref_download", "tmp");


Use Files.createTempFile

Siedlerchr · 2018-03-28T10:38:48Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

 public class FulltextFetchers {
    private static final Logger LOGGER = LoggerFactory.getLogger(FulltextFetchers.class);

+    private static final int FETCHER_TIMEOUT = 10;


Maybe add the unit of time we wait. Eg. 10 seconds/minutes or hours?

Siedlerchr · 2018-03-28T10:40:54Z

src/main/java/org/jabref/logic/importer/fetcher/GoogleScholar.java

+            if (link.first() != null) {
+                String target = link.first().attr("href");
+                // link present?
+                if (!"".equals(target) && new URLDownload(target).isPdf()) {


Siedlerchr

some minor things, but overall looks good

lenhard

Nice PR! I have tested it locally with a number of entries and it always worked. Very handy. I also think the scheme for prioritization is fine.

I have a few suggestions in the code that you can think about, but they are not blockers.

When executing this in Intellij, I get console messages of the form:
SORT: 0 https://someurl.com/somepdf.pdf I didn't find where these come from (also didn't search too much). Anyway, could you get rid of those?

lenhard · 2018-03-28T11:50:12Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

            }
+
+            return result.stream()
+                    .map(FulltextFetchers::waitForResult)


Do you really need this mapping? All futures that have not completed when the time runs out in the call to invokeAll above are cancelled, see https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#invokeAll(java.util.Collection,%20long,%20java.util.concurrent.TimeUnit)

So there is probably no point in waiting.

That's right, good point 👍

lenhard · 2018-03-28T11:55:43Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

+            findDoiForEntry(clonedEntry);
+        }
+
+        ExecutorService executor = Executors.newCachedThreadPool();


Just an idea: Building a thread pool consumes resources and you are doing this here for every method call. Maybe it would make sense to reuse the JabRefExecutorService.executeAll() in this method? JabRef already takes care of shutting down this thread pool, so you could avoid all the lifecycle overhead.

I used the JabRefExecutor service now. But I think the shutdown there is not correct and might never finish if the tasks itsel do not finish. Currently there is only a shutdown and not a shutdownNow call.

The only difference between shutdown and shutdownNow is that in the first case tasks are still allowed to complete and in the second they are cancelled. Tasks that never terminate will never terminate in both cases.

lenhard · 2018-03-28T11:59:32Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java


-                if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) {
-                    return result;
+    private void shutdownAndAwaitTermination(ExecutorService pool) {


Not exactly the "usual" way to shutdown a thread pool, but a fair compromise for keeping the application alive. Please note that this does not guarantee a shutdown and might result in memory leaks in weird and highly unlikely circumstances. I would prefer outsourcing this to the JabRefExecutorServices (see comment above).

Why isn't this the usual way? It is directl yfrom the Java docs 😄

lenhard · 2018-03-28T12:04:31Z

src/main/java/org/jabref/logic/importer/fetcher/GoogleScholar.java

+                if (!"".equals(target) && new URLDownload(target).isPdf()) {
+                    // TODO: check title inside pdf + length?
+                    // TODO: report error function needed?! query -> result
+                    LOGGER.info("Fulltext PDF found @ Google: " + target);


Is this log statement really needed? If you do a lot of downloads it kind of clutters up the log imho...

stefan-kolb · 2018-03-28T12:12:07Z

@lenhard The SORT messages should not be there anymore. Do you have an old state of the PR?! Thanks for your review, will work that in.

lenhard · 2018-03-28T12:18:30Z

@stefan-kolb Yes, sorry, my bad. Pulling from master doesn't update the branches you had checked out before...

Tested again, the message is gone. It's also very nice how you can switch between entries during the download and the pdf is linked to the entry without killing the entry editor :)

tobiasdiez

In general the code looks good to me and is a nice improvement. Nonetheless, I have a few remarks and suggestions...

tobiasdiez · 2018-03-28T13:02:53Z

src/main/java/org/jabref/logic/importer/FetcherResult.java

+import org.jabref.logic.importer.fetcher.TrustLevel;
+
+public final class FetcherResult {
+    public final TrustLevel trust;


Hide fields and make them accessible via getter?

Ok, will do so.

tobiasdiez · 2018-03-28T13:04:04Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

+                    .map(FulltextFetchers::waitForResult)
+                    .filter(Optional::isPresent)
+                    .map(Optional::get)
+                    .filter(res -> Objects.nonNull(res.source))


In my opinion, the fetcher should make sure that the URL returned is always non-null. Otherwise the usage of Optional is superfluous.

tobiasdiez · 2018-03-28T13:07:31Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

+                try {
+                    Optional<URL> result = fetcher.findFullText(entry);
+
+                    if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) {


Do we really need to check that the URL returned is a PDF? This is an additional request. I think, we should just trust the fetcher that it found an appropriate document.

This is the single point where this is checked. Otherwise every fetcher has to do this logic. It is vital so we don't get wrong fulltexts.

Yes, I was wondering why this check is necessary at all. If the fetcher determines that a given URL contains the right full-text (say because the API tell it so), then why do we need to check that it is a PDF?

Because sometimes it is an HTML site or the access is blocked for the users, resulting in something else than the fulltext 😄

tobiasdiez · 2018-03-28T13:08:11Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

+                    Optional<URL> result = fetcher.findFullText(entry);
+
+                    if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) {
+                        return Optional.of(new FetcherResult(fetcher.getTrustLevel(), result.get()));


Using Optional.map may be a bit easier to understand.

What kind of code should replace this?

I was thinking about something like

return fetcher.findFullText(entry) .filter(isPdf()) .map(url -> new FetcherResult(fetcher.getTrust(), url));

tobiasdiez · 2018-03-28T13:09:47Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

+            findDoiForEntry(clonedEntry);
+        }
+
+        ExecutorService executor = Executors.newCachedThreadPool();


We have the class JabRefExecutorService that manages executors. Reuse it here?

tobiasdiez · 2018-03-28T13:12:05Z

src/test/java/org/jabref/logic/importer/FulltextFetchersTest.java

+    public void higherTrustLevelWins() throws MalformedURLException {
+        final URL lowUrl = new URL("http://docs.oasis-open.org/opencsa/sca-bpel/sca-bpel-1.1-spec-cd-01.pdf");
+        final URL highUrl = new URL("http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.pdf");
+        FulltextFetcher finderHigh = new FulltextFetcher() {


I find mocking using mockito is slightly better / easier to read / less code.

tobiasdiez · 2018-03-28T13:15:21Z

src/test/java/org/jabref/logic/importer/WebFetchersTest.java

        List<FulltextFetcher> fullTextFetchers = WebFetchers.getFullTextFetchers(importFormatPreferences);

-        Set<Class<? extends FulltextFetcher>> expected = reflections.getSubTypesOf(FulltextFetcher.class);
+        Set<Class<? extends FulltextFetcher>> expected = Stream.of(


The test was written to remind when somebody implements a new fetcher but forgets to add it in getFullTextFetchers. This purpose is defeated if the list of fetchers is specified explicitly here. What was the reason for the change?

The problem was that the reflections find the classes from the comment before as sub classes then. Maybe mocking them does change this behavior!

lenhard · 2018-03-28T13:37:43Z

src/main/java/org/jabref/logic/importer/FulltextFetchers.java

+                .collect(Collectors.toList());
+    }
+
+    private void shutdownAndAwaitTermination(ExecutorService pool) {


Please remove this now unused private method.

This reverts commit 27fa0e8.

This reverts commit f59a3c6.

stefan-kolb · 2018-03-28T14:53:12Z

Ok guys, I think all of your suggestions are there now 😄

stefan-kolb added 2 commits March 22, 2018 15:11

Parallel fetchers and first wins

9613438

Trust level implementation #3881

1305a30

stefan-kolb mentioned this pull request Mar 23, 2018

Concurrent fetchers #3881

Closed

stefan-kolb and others added 2 commits March 27, 2018 11:06

Merge branch 'master' into priority-fetcher

de9ccf0

Fix ordering

f31c4b4

stefan-kolb changed the title ~~[WIP] Priority fetcher~~ Priority fetcher Mar 27, 2018

Add tests

f17bfef

stefan-kolb requested a review from lenhard March 27, 2018 13:04

Code style

f5dd114

stefan-kolb added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Mar 28, 2018

stefan-kolb added 4 commits March 28, 2018 09:36

Trust levels

67c39bd

Google refactoring

c3f8066

Syntax error

198b439

Reduce calls by one as mimeType is already known for fulltext as PDF #…

8821531

…3879

stefan-kolb requested review from Siedlerchr and tobiasdiez March 28, 2018 09:44

stefan-kolb added 3 commits March 28, 2018 12:07

Fix test

334719c

Unued imports

7293aba

Remove test

6d84a96

Siedlerchr reviewed Mar 28, 2018

View reviewed changes

Siedlerchr approved these changes Mar 28, 2018

View reviewed changes

Refactoring

7eb73c4

lenhard reviewed Mar 28, 2018

View reviewed changes

stefan-kolb added 3 commits March 28, 2018 14:25

Feedback

c4fec0f

Graceful shutdown and force shutdown for non-terminating tasks

f59a3c6

60 seconds

27fa0e8

tobiasdiez requested changes Mar 28, 2018

View reviewed changes

lenhard suggested changes Mar 28, 2018

View reviewed changes

stefan-kolb added 7 commits March 28, 2018 16:26

Revert test

189ca1f

Add Getters

8b0a5e8

Mock tests

cfd9d2c

Refactor to lambda

c89688e

Revert "60 seconds"

15b871c

This reverts commit 27fa0e8.

Revert "Graceful shutdown and force shutdown for non-terminating tasks"

806d060

This reverts commit f59a3c6.

Remove unused method

1888c33

lenhard approved these changes Mar 28, 2018

View reviewed changes

tobiasdiez merged commit 40a007a into master Mar 28, 2018

tobiasdiez deleted the priority-fetcher branch March 28, 2018 15:37

stefan-kolb mentioned this pull request Oct 7, 2020

Make the DOI Resolution Fetcher return nothing when the DOI leads to a host for which a tailored fetcher exists #6937

Closed

5 tasks

Uh oh!

Priority fetcher #3882

Priority fetcher #3882

Uh oh!

Conversation

stefan-kolb commented Mar 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Siedlerchr left a comment

Choose a reason for hiding this comment

Uh oh!

lenhard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefan-kolb commented Mar 28, 2018

Uh oh!

lenhard commented Mar 28, 2018

Uh oh!

tobiasdiez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefan-kolb commented Mar 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

stefan-kolb commented Mar 23, 2018 •

edited

Loading