Skip to content

Conversation

@jozkee
Copy link
Member

@jozkee jozkee commented Sep 20, 2022

Fixes #75482
Fixes #75360
Fixes #75921

@ghost
Copy link

ghost commented Sep 20, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #75482.

Author: Jozkee
Assignees: -
Labels:

area-System.IO

Milestone: -


int checksum = Checksum(destination);

if (utf16NameTruncatedLength < name.Length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected behavior if truncatedName as UTF8 only takes up, say, 98 of the 100 bytes in the Name field and then we spill over to Prefix for the rest? How do tools know the extra two bytes in Name aren't actually part of the name?

Copy link
Member Author

@jozkee jozkee Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is, the prefix logic is wrong, I filed #75360 for it. I don't want to fix it here because of time constraint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me from that issue: is the problem with how we read/interpret name and prefix, or is the issue with how we write out the name and prefix, or both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is how we write it.
We currently do this:

Ustar and PAX save in the name byte array only the bytes that fit, and the rest of that string is saved in the prefix field.

Which is actually a suffix.

Instead we should do this:

If the pathname is too long to fit in the 100 bytes provided by the standard format, it can be split at any / character with the first portion going into the prefix field. If the prefix field is not empty, the reader will prepend the prefix value and a / character to the regular name field to obtain the full pathname. The standard does not require a trailing / character on directory names, though most implementations still include this for compatibility reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then if the code that's there is 100% wrong and the code in the PR it's being updated to is 100% wrong, why are we changing that code? Seems like we should either fix it in this PR, or as part of changing it in this PR, this should fail in a more reliable way rather than outputting corrupt data. Do we plan to fix this issue for 7.0 as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should fix it for 7.0 as well. Please see #75902 (comment).

@jozkee jozkee requested a review from stephentoub September 20, 2022 17:13
@danmoseley
Copy link
Member

Could you describe the interop testing you're planning to do for this? Trying with various tar implementations. Might be worth a table.

@jozkee
Copy link
Member Author

jozkee commented Sep 21, 2022

@danmoseley I wasn't planning on adding interop tests to System.Formats.Tar.Tests. I did play a bit with GNU Tar with all the formats we support, to see how they handled non-ascii characters and all formats are able to write non-ascii correctly.
I also tested groupname and username and checked with linux commands useradd and groupadd and both commands were able to handle non-ascii.

@jozkee
Copy link
Member Author

jozkee commented Sep 22, 2022

Last commit addresses the problem with ustar prefix (#75360) and partially addresses the truncation issue (#75921) as the name won't be truncated on formats that do not suppoort unlimited size names, I said "partially" because other fields are still being truncated (linkname, uname, gname). I was hoping that I could present an sketch of what fixing those issues would look like and extend the fix to the other fields once you agree with it.

private int WriteUstarName(Span<byte> buffer)
{
int checksum = WriteName(buffer);
const int MaxPathname = FieldLengths.Prefix + 1 + FieldLengths.Name;
Copy link
Member

@stephentoub stephentoub Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the +1 for? The separator between the values in the two fields? Worth a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the separator between prefix and name, it can be neglected as the fields will be rejoined with one on read.


if (_format is TarEntryFormat.V7 && name.Length != utf16NameTruncatedLength)
{
throw new ArgumentException(SR.TarEntryNameExceedsMaxLength, ArgNameEntry);
Copy link
Member Author

@jozkee jozkee Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic here is that TarWriter.WriteEntry[Async] throws if entry's Name exceeds max length. It could be better to throw on TarEntry.set_Name.
Let me know if you want me to change this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know all the constraints at that point? If so, I agree that validating the name when it's being set is better, though in practice I don't know what consumption patterns look like and whether it'll actually make a meaningful difference, e.g. if you're just using our higher-level TarFile APIs, the exception's going to come from the same place regardless of where we throw it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know all the constraints at that point?

Yes, the only constraints are the entry format and the encoded size of the name. The only downside I see is that it will make this PR a bigger change.

Regarding the consumption patterns, I assume most people using TarWriter directly want to build a tar archive feeding it with a folder enumeration, and write it down to anything other than a FileStream.

@jozkee jozkee requested a review from stephentoub September 26, 2022 19:36
@jozkee
Copy link
Member Author

jozkee commented Sep 26, 2022

Updated PR to also fix the other two related issues, field truncation and ustar prefix logic (see description).

scoped ReadOnlySpan<byte> name;
scoped ReadOnlySpan<byte> prefix;

if (lastIdx < 1) // splitting at the root is not allowed.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// splitting at the root is not allowed.

I copied this behavior from GNU Tar, it does not store one single separtor in the prefix, I assume we should do the same.

No big deal to go the other way but "absolute path" is already very odd scenario for tar and I don't have a reason to not do the same.

On top of that, I think we should instruct users to be aware about/avoid using absolute paths with Tar APIs.

Copy link
Contributor

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jozkee for your help getting the 3 issues fixed at once. I left some suggestions and questions.

Comment on lines +921 to +923
private static int GetUtf8TextLength(ReadOnlySpan<char> text)
=> Encoding.UTF8.GetByteCount(text);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method is needed since it's only doing one thing. We can just call Encoding.UTF8.GetByteCount directly.

Suggested change
private static int GetUtf8TextLength(ReadOnlySpan<char> text)
=> Encoding.UTF8.GetByteCount(text);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added it as it is a recurrent call and it is easier to read IMO.

Copy link
Contributor

@carlossanlop carlossanlop Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave it to you as a nit. You can either remove them as I suggested, or make sure all the other call sites for Encoding.UTF8.GetByteCount in this same file call the method you're adding.

@jozkee jozkee requested a review from carlossanlop September 27, 2022 04:31
Copy link
Contributor

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for your help, @jozkee. I left a few extra nits, consider them optional and can be addressed later.

Span<byte> remaining = stackalloc byte[prefixBytesLength];
int encoded = Encoding.ASCII.GetBytes(_name.AsSpan(FieldLengths.Name, prefixBytesLength), remaining);
Debug.Assert(encoded == remaining.Length);
throw new ArgumentException(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw new ArgumentException(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry);
throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, nameof(TarEntry.Name)), ArgNameEntry);

insidious bug!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we had something like Debug.Assert(!Regex.IsMatch("{\d}", message) in the Exception base class constructor, would it catch bugs like this?

(It wouldn't catch something like throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry)); but that seems less likely)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc @stephentoub might it be worth it?

Copy link
Member

@stephentoub stephentoub Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could experiment and see. It'd only catch things in builds where corelib is debug/checked, and if there's any case in all of dotnet/runtime where we validly construct an exception with such a message, it'd need to be reverted. I'd be surprised if there isn't at least one.

Copy link
Member Author

@jozkee jozkee Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It wouldn't catch something like throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry)); but that seems less likely)

You could compare the matches of {\d} against args.Length in SR.Format to also catch this case and the prior.

@jozkee jozkee merged commit 620bc76 into dotnet:main Sep 28, 2022
@jozkee jozkee deleted the tar_utf8names branch September 28, 2022 18:39
@carlossanlop
Copy link
Contributor

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3146104764

@ghost ghost locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

5 participants