Skip to content

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Sep 30, 2023

These updates are for creating the build_options argument, moving max_steps inside build_options, and only doing inference when the user does not provide training data.


This change is Reviewable

@johnml1135
Copy link
Collaborator

dockerfile line 2 at r1 (raw file):

#compatability with Tensorflow 2.6.0 as per https://www.tensorflow.org/install/source#gpu
ARG PYTHON_VERSION=3.8

This should be updated to 3.11 - to match what poetry is expecting.

@johnml1135
Copy link
Collaborator

.devcontainer/dockerfile line 2 at r1 (raw file):

#compatability with Tensorflow 2.6.0 as per https://www.tensorflow.org/install/source#gpu
ARG PYTHON_VERSION=3.8

Update to python 3.11

@johnml1135
Copy link
Collaborator

machine/jobs/build_nmt_engine.py line 41 at r1 (raw file):

        SETTINGS.update(args)
        if args["build_options"]:
            SETTINGS.build_options = json.loads(args["build_options"])

There should be a try-except here with an intelligent error about how the build_options were not parsable.

@johnml1135
Copy link
Collaborator

machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

    parser.add_argument("--trg-lang", required=True, type=str, help="Target language tag")
    parser.add_argument("--clearml", default=False, action="store_true", help="Initializes a ClearML task")
    parser.add_argument("--build-options", default={}, help="Build configurations")

Will this default really work? What would happen if you parse it? Shouldn't it be "{}"?

@johnml1135
Copy link
Collaborator

machine/jobs/build_nmt_engine.py line 40 at r1 (raw file):

        SETTINGS.update(args)
        if args["build_options"]:

if args["build_options"] == "":
args["build_options"] = "{}"

@johnml1135
Copy link
Collaborator

machine/jobs/nmt_engine_build_job.py line 29 at r1 (raw file):

            source_corpus = self._shared_file_service.create_source_corpus()
            target_corpus = self._shared_file_service.create_target_corpus()
            parallel_corpus = source_corpus.align_rows(target_corpus)

We probably need a minimum size of target and source files. We should check for it and throw if it does not exist. @ddaspit - is it 500 lines?

@johnml1135
Copy link
Collaborator

machine/jobs/nmt_engine_build_job.py line 29 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

We probably need a minimum size of target and source files. We should check for it and throw if it does not exist. @ddaspit - is it 500 lines?

That is, a minimum number of aligned parallel_corpus.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @mshannon-sil)


machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

        self._nmt_model_factory.init()

        if self._shared_file_service.exists_source_corpus() and self._shared_file_service.exists_target_corpus():

Should we be also checking for a non-zero size? I am unsure of the logic of Machine - will it create a target file if none exist?

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Will this default really work? What would happen if you parse it? Shouldn't it be "{}"?

That default sets it to an empty dictionary. If build_options is an empty dictionary, then the check if args["build_options"]: evaluates to False and it's not parsed, leaving settings with the default values. However, I can change it to work with a default of "{}" if there's a benefit in doing it that way.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)


dockerfile line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This should be updated to 3.11 - to match what poetry is expecting.

Python 3.8 is fine. Machine.py supports 3.8-3.11.


pyproject.toml line 60 at r1 (raw file):

networkx = "^2.6.3"
charset-normalizer = "^2.1.1"
setuptools = "^68.2.2"

What is setuptools needed for?


machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

Previously, mshannon-sil wrote…

That default sets it to an empty dictionary. If build_options is an empty dictionary, then the check if args["build_options"]: evaluates to False and it's not parsed, leaving settings with the default values. However, I can change it to work with a default of "{}" if there's a benefit in doing it that way.

I would make the default an empty string or "{}". That way it is always the same type.


machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Should we be also checking for a non-zero size? I am unsure of the logic of Machine - will it create a target file if none exist?

Yes, that is correct. The source and target files will always exist. You need to check if the parallel corpus is empty. You can call parallel_corpus.count(include_empty=False) to get the training corpus size.


machine/jobs/nmt_engine_build_job.py line 29 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

That is, a minimum number of aligned parallel_corpus.

I wouldn't impose a minimum. We should leave it up to the calling application to decide.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)


pyproject.toml line 60 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is setuptools needed for?

I upgraded poetry and added setuptools so that it would pass the github actions test. It seemed kind of strange to me that it needed setuptools in poetry, so I tried removing it, but then it started failing again, so I kept it in. You can see all this in my commit history for the branch if you want to take a look.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, that is correct. The source and target files will always exist. You need to check if the parallel corpus is empty. You can call parallel_corpus.count(include_empty=False) to get the training corpus size.

To be clear, if the user does not include training files and Machine generates them, does Machine always generate empty files for both the source and the target training files?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)


pyproject.toml line 60 at r1 (raw file):

Previously, mshannon-sil wrote…

I upgraded poetry and added setuptools so that it would pass the github actions test. It seemed kind of strange to me that it needed setuptools in poetry, so I tried removing it, but then it started failing again, so I kept it in. You can see all this in my commit history for the branch if you want to take a look.

I removed setuptools from the pyproject.toml file. Machine.py already had a dependency on setuptools, it just needed to be updated to the latest version in the lock file.


machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, mshannon-sil wrote…

To be clear, if the user does not include training files and Machine generates them, does Machine always generate empty files for both the source and the target training files?

In that case, Machine will generate empty source and target files. It is still possible for a calling application to provide source and target corpus files that are non-empty and the parallel corpus to be empty. This can occur if the source and target files do not have any aligned segments. That is why you should use parallel_corpus.count(include_empty=False) to determine if the parallel corpus is empty.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @mshannon-sil)

…valid json in build_options, fix training data check
Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 12 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)


dockerfile line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Python 3.8 is fine. Machine.py supports 3.8-3.11.

Okay, I've left it at 3.8 for now, but I can upgrade it to 3.11 if we'd like.


pyproject.toml line 60 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I removed setuptools from the pyproject.toml file. Machine.py already had a dependency on setuptools, it just needed to be updated to the latest version in the lock file.

Thanks, glad you were able to get it working without setuptools in the toml file!


.devcontainer/dockerfile line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Update to python 3.11

See other comment on python versions


machine/jobs/build_nmt_engine.py line 40 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

if args["build_options"] == "":
args["build_options"] = "{}"

If build options is passed in as an empty string, it'll throw a json error, which we are now relaying to the user. So I don't think this check is needed any longer.


machine/jobs/build_nmt_engine.py line 41 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

There should be a try-except here with an intelligent error about how the build_options were not parsable.

Done. Let me know if you'd like any tweaks to the error message.


machine/jobs/build_nmt_engine.py line 74 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would make the default an empty string or "{}". That way it is always the same type.

Done.


machine/jobs/nmt_engine_build_job.py line 25 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

In that case, Machine will generate empty source and target files. It is still possible for a calling application to provide source and target corpus files that are non-empty and the parallel corpus to be empty. This can occur if the source and target files do not have any aligned segments. That is why you should use parallel_corpus.count(include_empty=False) to determine if the parallel corpus is empty.

Done.

@johnml1135
Copy link
Collaborator

machine/jobs/nmt_engine_build_job.py line 30 at r3 (raw file):

        parallel_corpus = source_corpus.align_rows(target_corpus)

        if parallel_corpus.count(include_empty=False):

One more thing - there should probably be something put in the log for the else case like "no matching entries in the source and target corpus - skipping training".

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r3, all commit messages.
Reviewable status: 11 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @mshannon-sil)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/jobs/nmt_engine_build_job.py line 30 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

One more thing - there should probably be something put in the log for the else case like "no matching entries in the source and target corpus - skipping training".

Done.

@johnml1135
Copy link
Collaborator

dockerfile line 2 at r1 (raw file):

Previously, mshannon-sil wrote…

Okay, I've left it at 3.8 for now, but I can upgrade it to 3.11 if we'd like.

It's fine at 3.8 for now. The TOML file allows it.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


dockerfile line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It's fine at 3.8 for now. The TOML file allows it.

Talking with John, it would be helpful to upgrade python to the latest version since Python 3.11 is much faster than Python 3.8. However, the way the devcontainer dockerfile is set up right now, any python above 3.9 doesn't work with it, and after wrangling with it for over an hour today I think it would be more efficient to work on upgrading it after I get done what I need to for MVP.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need the exists_source_corpus and exists_target_corpus methods anymore, but I don't have a problem with leaving them in if you want.

Reviewed 2 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@johnml1135 johnml1135 merged commit 70ec8ed into main Oct 4, 2023
@ddaspit ddaspit deleted the config_passing_serval_#45 branch October 4, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants