-
-
Notifications
You must be signed in to change notification settings - Fork 3
Config passing serval #45 #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d_options, only infer when no training data exists
This should be updated to 3.11 - to match what poetry is expecting. |
Update to python 3.11 |
There should be a try-except here with an intelligent error about how the build_options were not parsable. |
Will this default really work? What would happen if you parse it? Shouldn't it be "{}"? |
if args["build_options"] == "": |
We probably need a minimum size of target and source files. We should check for it and throw if it does not exist. @ddaspit - is it 500 lines? |
Previously, johnml1135 (John Lambert) wrote…
That is, a minimum number of aligned parallel_corpus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @mshannon-sil)
machine/jobs/nmt_engine_build_job.py
line 25 at r1 (raw file):
self._nmt_model_factory.init() if self._shared_file_service.exists_source_corpus() and self._shared_file_service.exists_target_corpus():
Should we be also checking for a non-zero size? I am unsure of the logic of Machine - will it create a target file if none exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/jobs/build_nmt_engine.py
line 74 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Will this default really work? What would happen if you parse it? Shouldn't it be "{}"?
That default sets it to an empty dictionary. If build_options is an empty dictionary, then the check if args["build_options"]:
evaluates to False and it's not parsed, leaving settings with the default values. However, I can change it to work with a default of "{}" if there's a benefit in doing it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)
dockerfile
line 2 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This should be updated to 3.11 - to match what poetry is expecting.
Python 3.8 is fine. Machine.py supports 3.8-3.11.
pyproject.toml
line 60 at r1 (raw file):
networkx = "^2.6.3" charset-normalizer = "^2.1.1" setuptools = "^68.2.2"
What is setuptools needed for?
machine/jobs/build_nmt_engine.py
line 74 at r1 (raw file):
Previously, mshannon-sil wrote…
That default sets it to an empty dictionary. If build_options is an empty dictionary, then the check
if args["build_options"]:
evaluates to False and it's not parsed, leaving settings with the default values. However, I can change it to work with a default of "{}" if there's a benefit in doing it that way.
I would make the default an empty string or "{}". That way it is always the same type.
machine/jobs/nmt_engine_build_job.py
line 25 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Should we be also checking for a non-zero size? I am unsure of the logic of Machine - will it create a target file if none exist?
Yes, that is correct. The source and target files will always exist. You need to check if the parallel corpus is empty. You can call parallel_corpus.count(include_empty=False)
to get the training corpus size.
machine/jobs/nmt_engine_build_job.py
line 29 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
That is, a minimum number of aligned parallel_corpus.
I wouldn't impose a minimum. We should leave it up to the calling application to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)
pyproject.toml
line 60 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
What is setuptools needed for?
I upgraded poetry and added setuptools so that it would pass the github actions test. It seemed kind of strange to me that it needed setuptools in poetry, so I tried removing it, but then it started failing again, so I kept it in. You can see all this in my commit history for the branch if you want to take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/jobs/nmt_engine_build_job.py
line 25 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Yes, that is correct. The source and target files will always exist. You need to check if the parallel corpus is empty. You can call
parallel_corpus.count(include_empty=False)
to get the training corpus size.
To be clear, if the user does not include training files and Machine generates them, does Machine always generate empty files for both the source and the target training files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)
pyproject.toml
line 60 at r1 (raw file):
Previously, mshannon-sil wrote…
I upgraded poetry and added setuptools so that it would pass the github actions test. It seemed kind of strange to me that it needed setuptools in poetry, so I tried removing it, but then it started failing again, so I kept it in. You can see all this in my commit history for the branch if you want to take a look.
I removed setuptools from the pyproject.toml file. Machine.py already had a dependency on setuptools, it just needed to be updated to the latest version in the lock file.
machine/jobs/nmt_engine_build_job.py
line 25 at r1 (raw file):
Previously, mshannon-sil wrote…
To be clear, if the user does not include training files and Machine generates them, does Machine always generate empty files for both the source and the target training files?
In that case, Machine will generate empty source and target files. It is still possible for a calling application to provide source and target corpus files that are non-empty and the parallel corpus to be empty. This can occur if the source and target files do not have any aligned segments. That is why you should use parallel_corpus.count(include_empty=False)
to determine if the parallel corpus is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @mshannon-sil)
…valid json in build_options, fix training data check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 9 of 12 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)
dockerfile
line 2 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Python 3.8 is fine. Machine.py supports 3.8-3.11.
Okay, I've left it at 3.8 for now, but I can upgrade it to 3.11 if we'd like.
pyproject.toml
line 60 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I removed setuptools from the pyproject.toml file. Machine.py already had a dependency on setuptools, it just needed to be updated to the latest version in the lock file.
Thanks, glad you were able to get it working without setuptools in the toml file!
.devcontainer/dockerfile
line 2 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Update to python 3.11
See other comment on python versions
machine/jobs/build_nmt_engine.py
line 40 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
if args["build_options"] == "":
args["build_options"] = "{}"
If build options is passed in as an empty string, it'll throw a json error, which we are now relaying to the user. So I don't think this check is needed any longer.
machine/jobs/build_nmt_engine.py
line 41 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
There should be a try-except here with an intelligent error about how the build_options were not parsable.
Done. Let me know if you'd like any tweaks to the error message.
machine/jobs/build_nmt_engine.py
line 74 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would make the default an empty string or "{}". That way it is always the same type.
Done.
machine/jobs/nmt_engine_build_job.py
line 25 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
In that case, Machine will generate empty source and target files. It is still possible for a calling application to provide source and target corpus files that are non-empty and the parallel corpus to be empty. This can occur if the source and target files do not have any aligned segments. That is why you should use
parallel_corpus.count(include_empty=False)
to determine if the parallel corpus is empty.
Done.
One more thing - there should probably be something put in the log for the else case like "no matching entries in the source and target corpus - skipping training". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r3, all commit messages.
Reviewable status: 11 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 10 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/jobs/nmt_engine_build_job.py
line 30 at r3 (raw file):
Previously, johnml1135 (John Lambert) wrote…
One more thing - there should probably be something put in the log for the else case like "no matching entries in the source and target corpus - skipping training".
Done.
Previously, mshannon-sil wrote…
It's fine at 3.8 for now. The TOML file allows it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
dockerfile
line 2 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
It's fine at 3.8 for now. The TOML file allows it.
Talking with John, it would be helpful to upgrade python to the latest version since Python 3.11 is much faster than Python 3.8. However, the way the devcontainer dockerfile is set up right now, any python above 3.9 doesn't work with it, and after wrangling with it for over an hour today I think it would be more efficient to work on upgrading it after I get done what I need to for MVP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need the exists_source_corpus
and exists_target_corpus
methods anymore, but I don't have a problem with leaving them in if you want.
Reviewed 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
These updates are for creating the
build_options
argument, movingmax_steps
insidebuild_options
, and only doing inference when the user does not provide training data.This change is