-
-
Notifications
You must be signed in to change notification settings - Fork 3k
Priority fetcher #3882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Priority fetcher #3882
Conversation
| tmp.deleteOnExit(); | ||
|
|
||
| URLDownload udl = new URLDownload(url); | ||
| final File tempFile = File.createTempFile("jabref_download", "tmp"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Files.createTempFile
| public class FulltextFetchers { | ||
| private static final Logger LOGGER = LoggerFactory.getLogger(FulltextFetchers.class); | ||
|
|
||
| private static final int FETCHER_TIMEOUT = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the unit of time we wait. Eg. 10 seconds/minutes or hours?
| if (link.first() != null) { | ||
| String target = link.first().attr("href"); | ||
| // link present? | ||
| if (!"".equals(target) && new URLDownload(target).isPdf()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isEmpty?
Siedlerchr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor things, but overall looks good
lenhard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! I have tested it locally with a number of entries and it always worked. Very handy. I also think the scheme for prioritization is fine.
I have a few suggestions in the code that you can think about, but they are not blockers.
When executing this in Intellij, I get console messages of the form:
SORT: 0 https://someurl.com/somepdf.pdf I didn't find where these come from (also didn't search too much). Anyway, could you get rid of those?
| } | ||
|
|
||
| return result.stream() | ||
| .map(FulltextFetchers::waitForResult) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need this mapping? All futures that have not completed when the time runs out in the call to invokeAll above are cancelled, see https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#invokeAll(java.util.Collection,%20long,%20java.util.concurrent.TimeUnit)
So there is probably no point in waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, good point 👍
| findDoiForEntry(clonedEntry); | ||
| } | ||
|
|
||
| ExecutorService executor = Executors.newCachedThreadPool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: Building a thread pool consumes resources and you are doing this here for every method call. Maybe it would make sense to reuse the JabRefExecutorService.executeAll() in this method? JabRef already takes care of shutting down this thread pool, so you could avoid all the lifecycle overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the JabRefExecutor service now. But I think the shutdown there is not correct and might never finish if the tasks itsel do not finish. Currently there is only a shutdown and not a shutdownNow call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference between shutdown and shutdownNow is that in the first case tasks are still allowed to complete and in the second they are cancelled. Tasks that never terminate will never terminate in both cases.
|
|
||
| if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) { | ||
| return result; | ||
| private void shutdownAndAwaitTermination(ExecutorService pool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly the "usual" way to shutdown a thread pool, but a fair compromise for keeping the application alive. Please note that this does not guarantee a shutdown and might result in memory leaks in weird and highly unlikely circumstances. I would prefer outsourcing this to the JabRefExecutorServices (see comment above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this the usual way? It is directl yfrom the Java docs 😄
| if (!"".equals(target) && new URLDownload(target).isPdf()) { | ||
| // TODO: check title inside pdf + length? | ||
| // TODO: report error function needed?! query -> result | ||
| LOGGER.info("Fulltext PDF found @ Google: " + target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this log statement really needed? If you do a lot of downloads it kind of clutters up the log imho...
|
@lenhard The SORT messages should not be there anymore. Do you have an old state of the PR?! Thanks for your review, will work that in. |
|
@stefan-kolb Yes, sorry, my bad. Pulling from master doesn't update the branches you had checked out before... Tested again, the message is gone. It's also very nice how you can switch between entries during the download and the pdf is linked to the entry without killing the entry editor :) |
tobiasdiez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the code looks good to me and is a nice improvement. Nonetheless, I have a few remarks and suggestions...
| import org.jabref.logic.importer.fetcher.TrustLevel; | ||
|
|
||
| public final class FetcherResult { | ||
| public final TrustLevel trust; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hide fields and make them accessible via getter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do so.
| .map(FulltextFetchers::waitForResult) | ||
| .filter(Optional::isPresent) | ||
| .map(Optional::get) | ||
| .filter(res -> Objects.nonNull(res.source)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the fetcher should make sure that the URL returned is always non-null. Otherwise the usage of Optional is superfluous.
| try { | ||
| Optional<URL> result = fetcher.findFullText(entry); | ||
|
|
||
| if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to check that the URL returned is a PDF? This is an additional request. I think, we should just trust the fetcher that it found an appropriate document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the single point where this is checked. Otherwise every fetcher has to do this logic. It is vital so we don't get wrong fulltexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was wondering why this check is necessary at all. If the fetcher determines that a given URL contains the right full-text (say because the API tell it so), then why do we need to check that it is a PDF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because sometimes it is an HTML site or the access is blocked for the users, resulting in something else than the fulltext 😄
| Optional<URL> result = fetcher.findFullText(entry); | ||
|
|
||
| if (result.isPresent() && new URLDownload(result.get().toString()).isPdf()) { | ||
| return Optional.of(new FetcherResult(fetcher.getTrustLevel(), result.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Optional.map may be a bit easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of code should replace this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about something like
return fetcher.findFullText(entry)
.filter(isPdf())
.map(url -> new FetcherResult(fetcher.getTrust(), url));| findDoiForEntry(clonedEntry); | ||
| } | ||
|
|
||
| ExecutorService executor = Executors.newCachedThreadPool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the class JabRefExecutorService that manages executors. Reuse it here?
| public void higherTrustLevelWins() throws MalformedURLException { | ||
| final URL lowUrl = new URL("http://docs.oasis-open.org/opencsa/sca-bpel/sca-bpel-1.1-spec-cd-01.pdf"); | ||
| final URL highUrl = new URL("http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.pdf"); | ||
| FulltextFetcher finderHigh = new FulltextFetcher() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find mocking using mockito is slightly better / easier to read / less code.
| List<FulltextFetcher> fullTextFetchers = WebFetchers.getFullTextFetchers(importFormatPreferences); | ||
|
|
||
| Set<Class<? extends FulltextFetcher>> expected = reflections.getSubTypesOf(FulltextFetcher.class); | ||
| Set<Class<? extends FulltextFetcher>> expected = Stream.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was written to remind when somebody implements a new fetcher but forgets to add it in getFullTextFetchers. This purpose is defeated if the list of fetchers is specified explicitly here. What was the reason for the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was that the reflections find the classes from the comment before as sub classes then. Maybe mocking them does change this behavior!
| .collect(Collectors.toList()); | ||
| } | ||
|
|
||
| private void shutdownAndAwaitTermination(ExecutorService pool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this now unused private method.
|
Ok guys, I think all of your suggestions are there now 😄 |
Trust Levels:
Current trust levels: