Skip to content

Conversation

@stefan-kolb
Copy link
Member

@stefan-kolb stefan-kolb commented Mar 23, 2018

  • all fetchers are run in parallel now
  • the result with the highest priority wins
  • InterruptedException | ExecutionException | CancellationException ignore is more critical; not sure if the code behaves as we want it everytime

Trust Levels:

  • SOURCE (highest)
  • PUBLISHER
  • PREPRINT
  • META_SEARCH
  • UNKNOWN

Current trust levels:

  • DOI: SOURCE
  • ScienceDircect: Publisher
  • Springer: Publisher
  • ACS: Publisher
  • IEEE: Publisher
  • Google Scholar: META_SEARCH
  • Arxiv: PREPRINT
  • OpenAccessDOI: META_SEARCH

@stefan-kolb stefan-kolb mentioned this pull request Mar 23, 2018
@stefan-kolb stefan-kolb changed the title [WIP] Priority fetcher Priority fetcher Mar 27, 2018
@stefan-kolb stefan-kolb requested a review from lenhard March 27, 2018 13:04
@stefan-kolb stefan-kolb added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Mar 28, 2018
tmp.deleteOnExit();

URLDownload udl = new URLDownload(url);
final File tempFile = File.createTempFile("jabref_download", "tmp");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Files.createTempFile

public class FulltextFetchers {
private static final Logger LOGGER = LoggerFactory.getLogger(FulltextFetchers.class);

private static final int FETCHER_TIMEOUT = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the unit of time we wait. Eg. 10 seconds/minutes or hours?

if (link.first() != null) {
String target = link.first().attr("href");
// link present?
if (!"".equals(target) && new URLDownload(target).isPdf()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEmpty?

Copy link
Member

@Siedlerchr Siedlerchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor things, but overall looks good

Copy link
Member

@lenhard lenhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! I have tested it locally with a number of entries and it always worked. Very handy. I also think the scheme for prioritization is fine.

I have a few suggestions in the code that you can think about, but they are not blockers.

When executing this in Intellij, I get console messages of the form:
SORT: 0 https://someurl.com/somepdf.pdf I didn't find where these come from (also didn't search too much). Anyway, could you get rid of those?

}

return result.stream()
.map(FulltextFetchers::waitForResult)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this mapping? All futures that have not completed when the time runs out in the call to invokeAll above are cancelled, see https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#invokeAll(java.util.Collection,%20long,%20java.util.concurrent.TimeUnit)

So there is probably no point in waiting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, good point 👍

findDoiForEntry(clonedEntry);
}

ExecutorService executor = Executors.newCachedThreadPool();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: Building a thread pool consumes resources and you are doing this here for every method call. Maybe it would make sense to reuse the JabRefExecutorService.executeAll() in this method? JabRef already takes care of shutting down this thread pool, so you could avoid all the lifecycle overhead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the JabRefExecutor service now. But I think the shutdown there is not correct and might never finish if the tasks itsel do not finish. Currently there is only a shutdown and not a shutdownNow call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference between shutdown and shutdownNow is that in the first case tasks are still allowed to complete and in the second they are cancelled. Tasks that never terminate will never terminate in both cases.


if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) {
return result;
private void shutdownAndAwaitTermination(ExecutorService pool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly the "usual" way to shutdown a thread pool, but a fair compromise for keeping the application alive. Please note that this does not guarantee a shutdown and might result in memory leaks in weird and highly unlikely circumstances. I would prefer outsourcing this to the JabRefExecutorServices (see comment above).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this the usual way? It is directl yfrom the Java docs 😄

if (!"".equals(target) && new URLDownload(target).isPdf()) {
// TODO: check title inside pdf + length?
// TODO: report error function needed?! query -> result
LOGGER.info("Fulltext PDF found @ Google: " + target);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this log statement really needed? If you do a lot of downloads it kind of clutters up the log imho...

@stefan-kolb
Copy link
Member Author

@lenhard The SORT messages should not be there anymore. Do you have an old state of the PR?! Thanks for your review, will work that in.

@lenhard
Copy link
Member

lenhard commented Mar 28, 2018

@stefan-kolb Yes, sorry, my bad. Pulling from master doesn't update the branches you had checked out before...

Tested again, the message is gone. It's also very nice how you can switch between entries during the download and the pdf is linked to the entry without killing the entry editor :)

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the code looks good to me and is a nice improvement. Nonetheless, I have a few remarks and suggestions...

import org.jabref.logic.importer.fetcher.TrustLevel;

public final class FetcherResult {
public final TrustLevel trust;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide fields and make them accessible via getter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do so.

.map(FulltextFetchers::waitForResult)
.filter(Optional::isPresent)
.map(Optional::get)
.filter(res -> Objects.nonNull(res.source))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the fetcher should make sure that the URL returned is always non-null. Otherwise the usage of Optional is superfluous.

try {
Optional<URL> result = fetcher.findFullText(entry);

if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to check that the URL returned is a PDF? This is an additional request. I think, we should just trust the fetcher that it found an appropriate document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the single point where this is checked. Otherwise every fetcher has to do this logic. It is vital so we don't get wrong fulltexts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was wondering why this check is necessary at all. If the fetcher determines that a given URL contains the right full-text (say because the API tell it so), then why do we need to check that it is a PDF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because sometimes it is an HTML site or the access is blocked for the users, resulting in something else than the fulltext 😄

Optional<URL> result = fetcher.findFullText(entry);

if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) {
return Optional.of(new FetcherResult(fetcher.getTrustLevel(), result.get()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Optional.map may be a bit easier to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of code should replace this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like

return fetcher.findFullText(entry)
          .filter(isPdf())
          .map(url -> new FetcherResult(fetcher.getTrust(), url));

findDoiForEntry(clonedEntry);
}

ExecutorService executor = Executors.newCachedThreadPool();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the class JabRefExecutorService that manages executors. Reuse it here?

public void higherTrustLevelWins() throws MalformedURLException {
final URL lowUrl = new URL("http://docs.oasis-open.org/opencsa/sca-bpel/sca-bpel-1.1-spec-cd-01.pdf");
final URL highUrl = new URL("http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.pdf");
FulltextFetcher finderHigh = new FulltextFetcher() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find mocking using mockito is slightly better / easier to read / less code.

List<FulltextFetcher> fullTextFetchers = WebFetchers.getFullTextFetchers(importFormatPreferences);

Set<Class<? extends FulltextFetcher>> expected = reflections.getSubTypesOf(FulltextFetcher.class);
Set<Class<? extends FulltextFetcher>> expected = Stream.of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was written to remind when somebody implements a new fetcher but forgets to add it in getFullTextFetchers. This purpose is defeated if the list of fetchers is specified explicitly here. What was the reason for the change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that the reflections find the classes from the comment before as sub classes then. Maybe mocking them does change this behavior!

.collect(Collectors.toList());
}

private void shutdownAndAwaitTermination(ExecutorService pool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this now unused private method.

@stefan-kolb
Copy link
Member Author

Ok guys, I think all of your suggestions are there now 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants