-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Use UTF8 encoding on Tar string fields #75902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-io |
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
|
|
||
| int checksum = Checksum(destination); | ||
|
|
||
| if (utf16NameTruncatedLength < name.Length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected behavior if truncatedName as UTF8 only takes up, say, 98 of the 100 bytes in the Name field and then we spill over to Prefix for the rest? How do tools know the extra two bytes in Name aren't actually part of the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing is, the prefix logic is wrong, I filed #75360 for it. I don't want to fix it here because of time constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me from that issue: is the problem with how we read/interpret name and prefix, or is the issue with how we write out the name and prefix, or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue is how we write it.
We currently do this:
Ustar and PAX save in the name byte array only the bytes that fit, and the rest of that string is saved in the prefix field.
Which is actually a suffix.
Instead we should do this:
If the pathname is too long to fit in the 100 bytes provided by the standard format, it can be split at any / character with the first portion going into the prefix field. If the prefix field is not empty, the reader will prepend the prefix value and a / character to the regular name field to obtain the full pathname. The standard does not require a trailing / character on directory names, though most implementations still include this for compatibility reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then if the code that's there is 100% wrong and the code in the PR it's being updated to is 100% wrong, why are we changing that code? Seems like we should either fix it in this PR, or as part of changing it in this PR, this should fail in a more reliable way rather than outputting corrupt data. Do we plan to fix this issue for 7.0 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should fix it for 7.0 as well. Please see #75902 (comment).
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
|
Could you describe the interop testing you're planning to do for this? Trying with various tar implementations. Might be worth a table. |
|
@danmoseley I wasn't planning on adding interop tests to |
|
Last commit addresses the problem with ustar prefix (#75360) and partially addresses the truncation issue (#75921) as the name won't be truncated on formats that do not suppoort unlimited size names, I said "partially" because other fields are still being truncated (linkname, uname, gname). I was hoping that I could present an sketch of what fixing those issues would look like and extend the fix to the other fields once you agree with it. |
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
| private int WriteUstarName(Span<byte> buffer) | ||
| { | ||
| int checksum = WriteName(buffer); | ||
| const int MaxPathname = FieldLengths.Prefix + 1 + FieldLengths.Name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the +1 for? The separator between the values in the two fields? Worth a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the separator between prefix and name, it can be neglected as the fields will be rejoined with one on read.
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
|
|
||
| if (_format is TarEntryFormat.V7 && name.Length != utf16NameTruncatedLength) | ||
| { | ||
| throw new ArgumentException(SR.TarEntryNameExceedsMaxLength, ArgNameEntry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic here is that TarWriter.WriteEntry[Async] throws if entry's Name exceeds max length. It could be better to throw on TarEntry.set_Name.
Let me know if you want me to change this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know all the constraints at that point? If so, I agree that validating the name when it's being set is better, though in practice I don't know what consumption patterns look like and whether it'll actually make a meaningful difference, e.g. if you're just using our higher-level TarFile APIs, the exception's going to come from the same place regardless of where we throw it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know all the constraints at that point?
Yes, the only constraints are the entry format and the encoded size of the name. The only downside I see is that it will make this PR a bigger change.
Regarding the consumption patterns, I assume most people using TarWriter directly want to build a tar archive feeding it with a folder enumeration, and write it down to anything other than a FileStream.
|
Updated PR to also fix the other two related issues, field truncation and ustar prefix logic (see description). |
| scoped ReadOnlySpan<byte> name; | ||
| scoped ReadOnlySpan<byte> prefix; | ||
|
|
||
| if (lastIdx < 1) // splitting at the root is not allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// splitting at the root is not allowed.
I copied this behavior from GNU Tar, it does not store one single separtor in the prefix, I assume we should do the same.
No big deal to go the other way but "absolute path" is already very odd scenario for tar and I don't have a reason to not do the same.
On top of that, I think we should instruct users to be aware about/avoid using absolute paths with Tar APIs.
carlossanlop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jozkee for your help getting the 3 issues fixed at once. I left some suggestions and questions.
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
| private static int GetUtf8TextLength(ReadOnlySpan<char> text) | ||
| => Encoding.UTF8.GetByteCount(text); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this method is needed since it's only doing one thing. We can just call Encoding.UTF8.GetByteCount directly.
| private static int GetUtf8TextLength(ReadOnlySpan<char> text) | |
| => Encoding.UTF8.GetByteCount(text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added it as it is a recurrent call and it is easier to read IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it to you as a nit. You can either remove them as I suggested, or make sure all the other call sites for Encoding.UTF8.GetByteCount in this same file call the method you're adding.
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntry.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntry.Entry.Roundtrip.Tests.cs
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntry.Entry.Roundtrip.Tests.cs
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntry.Entry.Roundtrip.Tests.cs
Show resolved
Hide resolved
carlossanlop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for your help, @jozkee. I left a few extra nits, consider them optional and can be addressed later.
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntryAsync.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntryAsync.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntryAsync.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntryAsync.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntryAsync.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntryAsync.Tests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs
Outdated
Show resolved
Hide resolved
| Span<byte> remaining = stackalloc byte[prefixBytesLength]; | ||
| int encoded = Encoding.ASCII.GetBytes(_name.AsSpan(FieldLengths.Name, prefixBytesLength), remaining); | ||
| Debug.Assert(encoded == remaining.Length); | ||
| throw new ArgumentException(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| throw new ArgumentException(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry); | |
| throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, nameof(TarEntry.Name)), ArgNameEntry); |
insidious bug!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we had something like Debug.Assert(!Regex.IsMatch("{\d}", message) in the Exception base class constructor, would it catch bugs like this?
(It wouldn't catch something like throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry)); but that seems less likely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc @stephentoub might it be worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could experiment and see. It'd only catch things in builds where corelib is debug/checked, and if there's any case in all of dotnet/runtime where we validly construct an exception with such a message, it'd need to be reverted. I'd be surprised if there isn't at least one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It wouldn't catch something like throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry)); but that seems less likely)
You could compare the matches of {\d} against args.Length in SR.Format to also catch this case and the prior.
|
/backport to release/7.0 |
|
Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3146104764 |
Fixes #75482
Fixes #75360
Fixes #75921