Skip to content

Conversation

@muellerzr
Copy link
Contributor

@muellerzr muellerzr commented Oct 23, 2023

What does this PR do?

This PR brings back the default capability of pushing tensorboard logs as part of .push_to_hub() by modifying the glob exclusion filter to specifically ignore intermediate checkpoints.

You can see an example here where logs were successfully uploaded: https://huggingface.co/muellerzr/sequence_classification/tree/main

Was removed in #25095, likely because we didn't think about the fact that users would want to push their logs with tensorboard

Fixes # (issue)

fixes #26321

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @LysandreJik

@muellerzr muellerzr changed the title Muellerzr logs Save TB logs as part of push_to_hub Oct 23, 2023
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 23, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you link to the PR that removed it ? (and maybe add a test!)

@muellerzr
Copy link
Contributor Author

Thanks @ArthurZucker, done (and added a test!)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for iterating!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @muellerzr

cc @Wauplin if you have time to give it a quick look

@Wauplin
Copy link
Contributor

Wauplin commented Oct 24, 2023

Thanks for the ping @LysandreJik! I'm currently reviewing it. Most probably good but I want to make a few tests first with the blob patterns (I found them a bit un-intuitive TBH). Will update soon :)

@Wauplin
Copy link
Contributor

Wauplin commented Oct 24, 2023

@muellerzr Unfortunately, ignore_patterns=["_*", "[!runs]**/*"], will not work as expected. I've made some tests with fnmatch and TIL that even tough it claims to provide support for "Unix shell-style wildcards", I feel it's not the case.

Especially:

  • * matches everything no matter if there is a / in it. Both README.md and subfolder/README.md are matched.
  • "[!runs]*" matches anything that starts with a characters that is not a "r", a "u", a "n" or a "s". Meaning runs/foobar, nqwerty/foobar, sqwerty/foobar/..., etc. are all matching
  • => So "[!runs]**/*" matches any folder starting with r/u/n/s which is not the expected result

image

Note that the filename separator ('/' on Unix) is not special to this module.


I've tried to define a good set of allow_pattern + ignore_pattern that would achieve what you want to do but unfortunately I don't think it's possible. I feel bad about it because I introduce this fnmatch module thinking it was the good one to use but since it doesn't implement the same specs as the Unix pattern it kinda useless. I expected that the pattern would follow the same rules as described when doing man 7 glob.

EDIT: just realized that ignore_patterns = ["_*", "[!r]*/*", "?[!u]*/*", "??[!n]*/*", "???[!s]*/*"] would work but it feels so stupid...

@Wauplin
Copy link
Contributor

Wauplin commented Oct 24, 2023

(here is my test script if you want to play with it. At first I thought it would be better to include any *.tfevents.* file, no matter the folder hence the tests. But doesn't seem possible to do anyway

from fnmatch import fnmatch

paths = [
    "readme.md",
    "subfolder/readme.md",
    "folder/readme.md",
    "_private.json",
    "subfolder/_private.json",
    "foo.tfevents.bar",
    "subfolder/foo.tfevents.bar",
    "runs/subfolder/foo.tfevents.bar",
    "not/subfolder/foo.tfevents.bar",
]

# allow_patterns = ["*.tfevents.*", "*[!\]*"]
allow_patterns = ["*"]

ignore_patterns = ["_*", "[!r]*/*", "?[!u]*/*", "??[!n]*/*", "???[!s]*/*"]

# pattern = '**/*[!tfevents]*'
# pattern = '[!runs]**/*'

def is_included(path):
    if any(fnmatch(path, r) for r in ignore_patterns):
        return False
    if any(fnmatch(path, r) for r in allow_patterns):
        return True
    return False

for path in paths:
    print(f"{path:<40}", is_included(path))

)

@Wauplin
Copy link
Contributor

Wauplin commented Oct 24, 2023

Actually looking back at the implementation before #25095, do we really need to exclude every folder -so the current ignore_patterns="**/*"- ? I don't see anywhere where Repositoryused to exclude some folders. If it was in a.gitignore`, I would be keen to see it.

I already thought about the fact that we should respect the .gitignore file (if any) when using upload_folder. It is not very straightforward but even a subset of the .gitignore's spec would be good. (only thinking out loud here, let's fix the PR independently of that)

@julien-c
Copy link
Member

we should respect the .gitignore file (if any) when using upload_folder

Or maybe server-side support (in case the .gitignore is uploaded first, of course). I don't know if it's the case server-side already or not, to be honest.

@muellerzr muellerzr requested a review from Wauplin October 26, 2023 13:23
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for making the changes @muellerzr 👍

@muellerzr muellerzr merged commit 34a6406 into main Oct 26, 2023
@muellerzr muellerzr deleted the muellerzr-logs branch October 26, 2023 16:13
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* Support runs/

* Upload runs folder as part of push to hub

* Add a test

* Add to test deps

* Update with proposed solution from Slack

* Ensure that repo gets deleted in tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TensorBoard integration on huggingface

7 participants