Skip to content

Conversation

haixiw
Copy link
Contributor

@haixiw haixiw commented Jun 15, 2023

Issue #, if available:
https://answers.amazon.com/posts/362803

Previously if customer set the input files channel to be s3://bucket/sagemaker/train/ (file mode)
it will only load the files directly under this path or a random sub dir based on workflow here

https://github.com/aws/sagemaker-xgboost-container/blob/master/src/sagemaker_xgboost_container/data_utils.py#L557
https://github.com/aws/sagemaker-xgboost-container/blob/master/src/sagemaker_xgboost_container/data_utils.py#L638-L647

With this change, the behavior of loading files will be consistent with pipe mode, which is that all the sub folders under the root dir customer specified will be traversed and all the files found there will be loaded.

Description of changes:

  1. Add a helper function to recursively creating symlinks from a folder to another folder
  2. Using the above helper function to create symlinks from all sub folders under root dir user specified to a temp folder. (XGB could load all files under the same dir)
  3. In order to prevent name collision as all the files are linked to the same dir, I added a suffix for the new filename in temp dir with the hash of the source file's path.

Testing:
unit tests,
Tested with Notebook : https://haixiw-test.notebook.us-west-2.sagemaker.aws/notebooks/XGBoost_abalone/xgboost_abalone.ipynb

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@haixiw haixiw requested review from a team, mabunday and malav-shastri June 15, 2023 00:14
@mabunday
Copy link
Contributor

Hmm my only concern here is that it changes behavior for existing customers. Can you also take a look at the PR that added the function in question and see if there's anything we need to account for here? https://github.com/aws/sagemaker-xgboost-container/pull/158/files

@malav-shastri
Copy link
Member

Thanks for this PR, apart from above just curious to know, as discussed in our sync were we actually able to find similar logic in 1P closed source containers(AIAlgorithmsSDK) where this is already being supported? And whether this fix is based on that to maintain consistency(although that's not needed but just to get the understanding about it)

@haixiw
Copy link
Contributor Author

haixiw commented Jun 16, 2023

Hmm my only concern here is that it changes behavior for existing customers. Can you also take a look at the PR that added the function in question and see if there's anything we need to account for here? https://github.com/aws/sagemaker-xgboost-container/pull/158/files

Thanks for the headsup. I think theoretically this change won't affect the previous workflow with cross validation after revisited https://github.com/aws/sagemaker-xgboost-container/pull/158/files. There's a local integ test for cross validation here https://github.com/aws/sagemaker-xgboost-container/blob/master/test/integration/local/test_kfold.py, also I will sync with AP oncall to see if they could perform their integ tests on some test images.

@haixiw
Copy link
Contributor Author

haixiw commented Jun 16, 2023

Thanks for this PR, apart from above just curious to know, as discussed in our sync were we actually able to find similar logic in 1P closed source containers(AIAlgorithmsSDK) where this is already being supported? And whether this fix is based on that to maintain consistency(although that's not needed but just to get the understanding about it)

Thanks for the comment! I haven't looked into that closely but let's keep this discussion through slack as we don't want expose much SDK stuff here.


def _make_symlinks_for_files_under_a_folder(dest_path: str, data_path: str):
if (not os.path.exists(dest_path)) or (not os.path.exists(data_path)):
raise exc.AlgorithmError("Unable to create symlinks as {data_path} or {dest_path} doesn't exist ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be an f-string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, yeah, I will update

if (not os.path.exists(dest_path)) or (not os.path.exists(data_path)):
raise exc.AlgorithmError("Unable to create symlinks as {data_path} or {dest_path} doesn't exist ")

logging.info("Making smlinks from folder {} to folder {}".format(data_path, dest_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can use f-strings here instead of format for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, Thanks

if item.is_file():
_make_symlink(item.path, dest_path, item.name)
elif item.is_dir():
_make_symlinks_for_files_under_a_folder(dest_path, item.path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about recursion depth here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emmm, I was thinking about this too. I will add a depth limit of 5 here.

Comment on lines 568 to 572
elif os.path.isdir(data_path):
# traverse all sub-dirs to load all training data
_make_symlinks_for_files_under_a_folder(files_path, data_path)
elif os.path.isfile(data_path):
files_path = data_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call _make_symlinks_for_files_under_a_folder here? Isn't it able to handle singular files:?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it 's able to do that..Let me refactor the info message in next commit to prevent confusion.

os.symlink(path, base_name)
def _make_symlink(path, source_path, name):
base_name = os.path.join(source_path, name)
file_name = base_name + str(hash(path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in case of hash collisions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error will be thrown out saying there is an existing symlink between A and B. The reason for hash(path) is that no matter how complicated the file structure is, the path for a single file is normally unique and it won't be too long of a string when we limit the depth for recursion.

@haixiw haixiw marked this pull request as ready for review June 20, 2023 05:10
@haixiw haixiw requested a review from mabunday June 20, 2023 16:15
return files_path


def _make_symlinks_from_a_folder(dest_path: str, data_path: str, depth: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we made this an iterative function we wouldn't need to worry about this depth at all (though it would probably still make sense to have some cap). Feel free to leave as is, but maybe 5 is a more reasonable depth.

@malav-shastri
Copy link
Member

Do we need to fix 1.3 and 1.5 as well?

@haixiw
Copy link
Contributor Author

haixiw commented Jun 22, 2023

Do we need to fix 1.3 and 1.5 as well?

Yes. This seems a general issue for all versions.

@haixiw haixiw merged commit 393a24e into master Jun 22, 2023
@haixiw haixiw deleted the fix_data_load branch June 22, 2023 20:22
haixiw added a commit that referenced this pull request Jul 27, 2023
* Fix: load training data from all sub dirs

* fix flake8 formatting

* fix local integ test

* resolve some comments
haixiw added a commit that referenced this pull request Jul 27, 2023
* Fix: load training data from all sub dirs

* fix flake8 formatting

* fix local integ test

* resolve some comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants