Skip to content

Conversation

@LeontiBrechko
Copy link
Collaborator

Description

Move spark transfer metadata to a subdirectory of the target export S3 location

Testing

print(f'Identified bucket: {bucket}, prefix: {prefix}')

# List all files in the s3_path directory
response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter='/')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this return folders as well? If only files are returned, then we are good.

Copy link
Collaborator Author

@LeontiBrechko LeontiBrechko Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only files. Same for listing objects in our Java repo using Amplitude's S3 wrapper

The example job mentioned in the description has the logs of what was discovered (note that /meta folder exists at the point of execution of this method and is not listed)

if '://' not in s3_uri_with_spark_metadata:
raise ValueError(f'Invalid s3 URI: {s3_uri_with_spark_metadata}. Expected to contain "://".')
bucket, prefix = s3_uri_with_spark_metadata.split('://')[1].split('/', 1)
bucket = replace_double_slashes_with_single_slash(bucket)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious where double slashes are coming from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for bucket, it shouldn't

It's just a sanitization in case some unnormalized input for s3uri is provided (e.g. s3:////bucket////prefix////)

expected_output = '/path/to/file/with/double/slashes/end/'
self.assertEqual(expected_output, replace_double_slashes_with_single_slash(input_string))

def test_move_spark_metadata_to_separate_s3_folder(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding tests!

Copy link
Collaborator

@fzqgriffin fzqgriffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's hold on merge until AMP-96980 is approved so the other PR can still focus on event mutation stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants