Config passing serval #45 #32

mshannon-sil · 2023-09-30T02:12:46Z

These updates are for creating the build_options argument, moving max_steps inside build_options, and only doing inference when the user does not provide training data.

This change is

…d_options, only infer when no training data exists

johnml1135 · 2023-10-02T13:37:06Z

dockerfile line 2 at r1 (raw file):

#compatability with Tensorflow 2.6.0 as per https://www.tensorflow.org/install/source#gpu
ARG PYTHON_VERSION=3.8

This should be updated to 3.11 - to match what poetry is expecting.

johnml1135 · 2023-10-02T13:38:29Z

.devcontainer/dockerfile line 2 at r1 (raw file):

#compatability with Tensorflow 2.6.0 as per https://www.tensorflow.org/install/source#gpu
ARG PYTHON_VERSION=3.8

Update to python 3.11

johnml1135 · 2023-10-02T13:40:37Z

machine/jobs/build_nmt_engine.py line 41 at r1 (raw file):

        SETTINGS.update(args)
        if args["build_options"]:
            SETTINGS.build_options = json.loads(args["build_options"])

There should be a try-except here with an intelligent error about how the build_options were not parsable.

johnml1135 · 2023-10-02T13:42:21Z

machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

    parser.add_argument("--trg-lang", required=True, type=str, help="Target language tag")
    parser.add_argument("--clearml", default=False, action="store_true", help="Initializes a ClearML task")
    parser.add_argument("--build-options", default={}, help="Build configurations")

Will this default really work? What would happen if you parse it? Shouldn't it be "{}"?

johnml1135 · 2023-10-02T13:43:01Z

machine/jobs/build_nmt_engine.py line 40 at r1 (raw file):

        SETTINGS.update(args)
        if args["build_options"]:

if args["build_options"] == "":
args["build_options"] = "{}"

johnml1135 · 2023-10-02T13:45:54Z

machine/jobs/nmt_engine_build_job.py line 29 at r1 (raw file):

            source_corpus = self._shared_file_service.create_source_corpus()
            target_corpus = self._shared_file_service.create_target_corpus()
            parallel_corpus = source_corpus.align_rows(target_corpus)

We probably need a minimum size of target and source files. We should check for it and throw if it does not exist. @ddaspit - is it 500 lines?

johnml1135 · 2023-10-02T13:46:58Z

machine/jobs/nmt_engine_build_job.py line 29 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

We probably need a minimum size of target and source files. We should check for it and throw if it does not exist. @ddaspit - is it 500 lines?

That is, a minimum number of aligned parallel_corpus.

johnml1135

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @mshannon-sil)

machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

        self._nmt_model_factory.init()

        if self._shared_file_service.exists_source_corpus() and self._shared_file_service.exists_target_corpus():

Should we be also checking for a non-zero size? I am unsure of the logic of Machine - will it create a target file if none exist?

mshannon-sil

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)

machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Will this default really work? What would happen if you parse it? Shouldn't it be "{}"?

That default sets it to an empty dictionary. If build_options is an empty dictionary, then the check if args["build_options"]: evaluates to False and it's not parsed, leaving settings with the default values. However, I can change it to work with a default of "{}" if there's a benefit in doing it that way.

ddaspit

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)

dockerfile line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This should be updated to 3.11 - to match what poetry is expecting.

Python 3.8 is fine. Machine.py supports 3.8-3.11.

pyproject.toml line 60 at r1 (raw file):

networkx = "^2.6.3"
charset-normalizer = "^2.1.1"
setuptools = "^68.2.2"

What is setuptools needed for?

machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

Previously, mshannon-sil wrote…

That default sets it to an empty dictionary. If build_options is an empty dictionary, then the check if args["build_options"]: evaluates to False and it's not parsed, leaving settings with the default values. However, I can change it to work with a default of "{}" if there's a benefit in doing it that way.

I would make the default an empty string or "{}". That way it is always the same type.

machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Should we be also checking for a non-zero size? I am unsure of the logic of Machine - will it create a target file if none exist?

Yes, that is correct. The source and target files will always exist. You need to check if the parallel corpus is empty. You can call parallel_corpus.count(include_empty=False) to get the training corpus size.

machine/jobs/nmt_engine_build_job.py line 29 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

That is, a minimum number of aligned parallel_corpus.

I wouldn't impose a minimum. We should leave it up to the calling application to decide.

mshannon-sil

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)

pyproject.toml line 60 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is setuptools needed for?

I upgraded poetry and added setuptools so that it would pass the github actions test. It seemed kind of strange to me that it needed setuptools in poetry, so I tried removing it, but then it started failing again, so I kept it in. You can see all this in my commit history for the branch if you want to take a look.

mshannon-sil

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)

machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, that is correct. The source and target files will always exist. You need to check if the parallel corpus is empty. You can call parallel_corpus.count(include_empty=False) to get the training corpus size.

To be clear, if the user does not include training files and Machine generates them, does Machine always generate empty files for both the source and the target training files?

ddaspit

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)

pyproject.toml line 60 at r1 (raw file):

Previously, mshannon-sil wrote…

I upgraded poetry and added setuptools so that it would pass the github actions test. It seemed kind of strange to me that it needed setuptools in poetry, so I tried removing it, but then it started failing again, so I kept it in. You can see all this in my commit history for the branch if you want to take a look.

I removed setuptools from the pyproject.toml file. Machine.py already had a dependency on setuptools, it just needed to be updated to the latest version in the lock file.

machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, mshannon-sil wrote…

To be clear, if the user does not include training files and Machine generates them, does Machine always generate empty files for both the source and the target training files?

In that case, Machine will generate empty source and target files. It is still possible for a calling application to provide source and target corpus files that are non-empty and the parallel corpus to be empty. This can occur if the source and target files do not have any aligned segments. That is why you should use parallel_corpus.count(include_empty=False) to determine if the parallel corpus is empty.

johnml1135

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @mshannon-sil)

…valid json in build_options, fix training data check

mshannon-sil

Reviewable status: 9 of 12 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)

dockerfile line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Python 3.8 is fine. Machine.py supports 3.8-3.11.

Okay, I've left it at 3.8 for now, but I can upgrade it to 3.11 if we'd like.

pyproject.toml line 60 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I removed setuptools from the pyproject.toml file. Machine.py already had a dependency on setuptools, it just needed to be updated to the latest version in the lock file.

Thanks, glad you were able to get it working without setuptools in the toml file!

.devcontainer/dockerfile line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Update to python 3.11

See other comment on python versions

machine/jobs/build_nmt_engine.py line 40 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

if args["build_options"] == "":
args["build_options"] = "{}"

If build options is passed in as an empty string, it'll throw a json error, which we are now relaying to the user. So I don't think this check is needed any longer.

machine/jobs/build_nmt_engine.py line 41 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There should be a try-except here with an intelligent error about how the build_options were not parsable.

Done. Let me know if you'd like any tweaks to the error message.

machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would make the default an empty string or "{}". That way it is always the same type.

Done.

machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

In that case, Machine will generate empty source and target files. It is still possible for a calling application to provide source and target corpus files that are non-empty and the parallel corpus to be empty. This can occur if the source and target files do not have any aligned segments. That is why you should use parallel_corpus.count(include_empty=False) to determine if the parallel corpus is empty.

Done.

johnml1135 · 2023-10-03T16:39:09Z

machine/jobs/nmt_engine_build_job.py line 30 at r3 (raw file):

        parallel_corpus = source_corpus.align_rows(target_corpus)

        if parallel_corpus.count(include_empty=False):

One more thing - there should probably be something put in the log for the else case like "no matching entries in the source and target corpus - skipping training".

johnml1135

Reviewed 2 of 3 files at r3, all commit messages.
Reviewable status: 11 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @mshannon-sil)

mshannon-sil

Reviewable status: 10 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)

machine/jobs/nmt_engine_build_job.py line 30 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

One more thing - there should probably be something put in the log for the else case like "no matching entries in the source and target corpus - skipping training".

Done.

johnml1135 · 2023-10-03T18:08:44Z

dockerfile line 2 at r1 (raw file):

Previously, mshannon-sil wrote…

Okay, I've left it at 3.8 for now, but I can upgrade it to 3.11 if we'd like.

It's fine at 3.8 for now. The TOML file allows it.

johnml1135

Reviewed 1 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)

mshannon-sil

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

dockerfile line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It's fine at 3.8 for now. The TOML file allows it.

Talking with John, it would be helpful to upgrade python to the latest version since Python 3.11 is much faster than Python 3.8. However, the way the devcontainer dockerfile is set up right now, any python above 3.9 doesn't work with it, and after wrangling with it for over an hour today I think it would be more efficient to work on upgrading it after I get done what I need to for MVP.

ddaspit

We don't really need the exists_source_corpus and exists_target_corpus methods anymore, but I don't have a problem with leaving them in if you want.

Reviewed 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

johnml1135

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

mshannon-sil added 11 commits September 26, 2023 16:21

add build_options argument, move max_steps and model_type inside buil…

00a5511

…d_options, only infer when no training data exists

merge changes from main

03c9164

update poetry.lock, launch.json

b0d8e85

try adding setuptools to poetry

6e513f3

run dep installation twice as work around

4715fb5

add setuptools to pyproject.toml

3253143

update poetry to 1.6.1

41b0bc5

remove setup tools from pyproject.toml

50fd4ae

remove setuptools from lock file

be7e697

revert to last passing commit

c9a6020

move model_type back out of build_options

24f1016

mshannon-sil added the enhancement label Sep 30, 2023

mshannon-sil added this to the 1.1 Mother Tongue MVP milestone Sep 30, 2023

mshannon-sil requested review from ddaspit and johnml1135 September 30, 2023 02:12

mshannon-sil self-assigned this Sep 30, 2023

johnml1135 requested changes Oct 2, 2023

View reviewed changes

mshannon-sil commented Oct 2, 2023

View reviewed changes

ddaspit reviewed Oct 2, 2023

View reviewed changes

mshannon-sil commented Oct 2, 2023

View reviewed changes

Remove setuptools from pyproject.toml

8a5ed61

ddaspit reviewed Oct 2, 2023

View reviewed changes

johnml1135 reviewed Oct 3, 2023

View reviewed changes

change default build_options to str type, custom error message for in…

b0fc2f9

…valid json in build_options, fix training data check

mshannon-sil commented Oct 3, 2023

View reviewed changes

johnml1135 reviewed Oct 3, 2023

View reviewed changes

log message when skipping training

f2839cb

mshannon-sil commented Oct 3, 2023

View reviewed changes

johnml1135 approved these changes Oct 3, 2023

View reviewed changes

mshannon-sil commented Oct 3, 2023

View reviewed changes

ddaspit approved these changes Oct 4, 2023

View reviewed changes

johnml1135 approved these changes Oct 4, 2023

View reviewed changes

johnml1135 merged commit 70ec8ed into main Oct 4, 2023

ddaspit deleted the config_passing_serval_#45 branch October 4, 2023 15:48

Uh oh!

Config passing serval #45 #32

Config passing serval #45 #32

Uh oh!

Conversation

mshannon-sil commented Sep 30, 2023 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnml1135 commented Oct 2, 2023

Uh oh!

johnml1135 commented Oct 2, 2023

Uh oh!

johnml1135 commented Oct 2, 2023

Uh oh!

johnml1135 commented Oct 2, 2023

Uh oh!

johnml1135 commented Oct 2, 2023

Uh oh!

johnml1135 commented Oct 2, 2023

Uh oh!

johnml1135 commented Oct 2, 2023

Uh oh!

johnml1135 left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

johnml1135 left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

johnml1135 commented Oct 3, 2023

Uh oh!

johnml1135 left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

johnml1135 commented Oct 3, 2023

Uh oh!

johnml1135 left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

johnml1135 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mshannon-sil commented Sep 30, 2023 •

edited by ddaspit

Loading