Use UTF8 encoding on Tar string fields #75902

jozkee · 2022-09-20T15:57:26Z

ghost · 2022-09-20T15:57:58Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Author:	Jozkee
Assignees:	-
Labels:	`area-System.IO`
Milestone:	-

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

stephentoub · 2022-09-20T16:20:31Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

+
+            int checksum = Checksum(destination);
+
+            if (utf16NameTruncatedLength < name.Length)


What's the expected behavior if truncatedName as UTF8 only takes up, say, 98 of the 100 bytes in the Name field and then we spill over to Prefix for the rest? How do tools know the extra two bytes in Name aren't actually part of the name?

the thing is, the prefix logic is wrong, I filed #75360 for it. I don't want to fix it here because of time constraint.

It's not clear to me from that issue: is the problem with how we read/interpret name and prefix, or is the issue with how we write out the name and prefix, or both?

the issue is how we write it.
We currently do this:

Ustar and PAX save in the name byte array only the bytes that fit, and the rest of that string is saved in the prefix field.

Which is actually a suffix.

Instead we should do this:

If the pathname is too long to fit in the 100 bytes provided by the standard format, it can be split at any / character with the first portion going into the prefix field. If the prefix field is not empty, the reader will prepend the prefix value and a / character to the regular name field to obtain the full pathname. The standard does not require a trailing / character on directory names, though most implementations still include this for compatibility reasons.

Then if the code that's there is 100% wrong and the code in the PR it's being updated to is 100% wrong, why are we changing that code? Seems like we should either fix it in this PR, or as part of changing it in this PR, this should fail in a more reliable way rather than outputting corrupt data. Do we plan to fix this issue for 7.0 as well?

IMO, we should fix it for 7.0 as well. Please see #75902 (comment).

src/libraries/System.Formats.Tar/tests/TarTestsBase.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

danmoseley · 2022-09-21T15:22:09Z

Could you describe the interop testing you're planning to do for this? Trying with various tar implementations. Might be worth a table.

jozkee · 2022-09-21T15:59:16Z

@danmoseley I wasn't planning on adding interop tests to System.Formats.Tar.Tests. I did play a bit with GNU Tar with all the formats we support, to see how they handled non-ascii characters and all formats are able to write non-ascii correctly.
I also tested groupname and username and checked with linux commands useradd and groupadd and both commands were able to handle non-ascii.

jozkee · 2022-09-22T15:33:17Z

Last commit addresses the problem with ustar prefix (#75360) and partially addresses the truncation issue (#75921) as the name won't be truncated on formats that do not suppoort unlimited size names, I said "partially" because other fields are still being truncated (linkname, uname, gname). I was hoping that I could present an sketch of what fixing those issues would look like and extend the fix to the other fields once you agree with it.

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

stephentoub · 2022-09-22T15:46:00Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

+        private int WriteUstarName(Span<byte> buffer)
        {
-            int checksum = WriteName(buffer);
+            const int MaxPathname = FieldLengths.Prefix + 1 + FieldLengths.Name;


What is the +1 for? The separator between the values in the two fields? Worth a comment.

For the separator between prefix and name, it can be neglected as the fields will be rejoined with one on read.

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

jozkee · 2022-09-22T17:33:06Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

+
+            if (_format is TarEntryFormat.V7 && name.Length != utf16NameTruncatedLength)
+            {
+                throw new ArgumentException(SR.TarEntryNameExceedsMaxLength, ArgNameEntry);


The semantic here is that TarWriter.WriteEntry[Async] throws if entry's Name exceeds max length. It could be better to throw on TarEntry.set_Name.
Let me know if you want me to change this behavior.

Do we know all the constraints at that point? If so, I agree that validating the name when it's being set is better, though in practice I don't know what consumption patterns look like and whether it'll actually make a meaningful difference, e.g. if you're just using our higher-level TarFile APIs, the exception's going to come from the same place regardless of where we throw it.

Do we know all the constraints at that point?

Yes, the only constraints are the entry format and the encoded size of the name. The only downside I see is that it will make this PR a bigger change.

Regarding the consumption patterns, I assume most people using TarWriter directly want to build a tar archive feeding it with a folder enumeration, and write it down to anything other than a FileStream.

jozkee · 2022-09-26T19:38:31Z

Updated PR to also fix the other two related issues, field truncation and ustar prefix logic (see description).

jozkee · 2022-09-26T19:48:45Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

+            scoped ReadOnlySpan<byte> name;
+            scoped ReadOnlySpan<byte> prefix;
+
+            if (lastIdx < 1) // splitting at the root is not allowed.


// splitting at the root is not allowed.

I copied this behavior from GNU Tar, it does not store one single separtor in the prefix, I assume we should do the same.

No big deal to go the other way but "absolute path" is already very odd scenario for tar and I don't have a reason to not do the same.

On top of that, I think we should instruct users to be aware about/avoid using absolute paths with Tar APIs.

carlossanlop

Thanks @jozkee for your help getting the 3 issues fixed at once. I left some suggestions and questions.

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

carlossanlop · 2022-09-26T23:51:02Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

+        private static int GetUtf8TextLength(ReadOnlySpan<char> text)
+            => Encoding.UTF8.GetByteCount(text);
+


I don't think this method is needed since it's only doing one thing. We can just call Encoding.UTF8.GetByteCount directly.

Suggested change

private static int GetUtf8TextLength(ReadOnlySpan<char> text)

=> Encoding.UTF8.GetByteCount(text);

I just added it as it is a recurrent call and it is easier to read IMO.

I leave it to you as a nit. You can either remove them as I suggested, or make sure all the other call sites for Encoding.UTF8.GetByteCount in this same file call the method you're adding.

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntry.Tests.cs

src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntry.Entry.Roundtrip.Tests.cs

carlossanlop

LGTM. Thank you for your help, @jozkee. I left a few extra nits, consider them optional and can be addressed later.

src/libraries/System.Formats.Tar/tests/TarWriter/TarWriter.WriteEntryAsync.Tests.cs

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

danmoseley · 2022-09-28T02:27:48Z

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs

-                Span<byte> remaining = stackalloc byte[prefixBytesLength];
-                int encoded = Encoding.ASCII.GetBytes(_name.AsSpan(FieldLengths.Name, prefixBytesLength), remaining);
-                Debug.Assert(encoded == remaining.Length);
+                throw new ArgumentException(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry);


Suggested change

throw new ArgumentException(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry);

throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, nameof(TarEntry.Name)), ArgNameEntry);

insidious bug!

if we had something like Debug.Assert(!Regex.IsMatch("{\d}", message) in the Exception base class constructor, would it catch bugs like this?

(It wouldn't catch something like throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry)); but that seems less likely)

Cc @stephentoub might it be worth it?

You could experiment and see. It'd only catch things in builds where corelib is debug/checked, and if there's any case in all of dotnet/runtime where we validly construct an exception with such a message, it'd need to be reverted. I'd be surprised if there isn't at least one.

(It wouldn't catch something like throw new ArgumentException(SR.Format(SR.TarEntryFieldExceedsMaxLength, ArgNameEntry)); but that seems less likely)

You could compare the matches of {\d} against args.Length in SR.Format to also catch this case and the prior.

carlossanlop · 2022-09-28T18:53:02Z

/backport to release/7.0

github-actions · 2022-09-28T18:53:14Z

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3146104764

Use UTF8 encoding on Tar string fields

7209fe1

jozkee requested review from adamsitnik, carlossanlop and stephentoub September 20, 2022 15:57

ghost added the area-System.IO label Sep 20, 2022

ghost assigned jozkee Sep 20, 2022

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Show resolved Hide resolved

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/tests/TarTestsBase.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 20, 2022

View reviewed changes

src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Write.cs Show resolved Hide resolved

jozkee added 3 commits September 20, 2022 11:52

Slice destination on Checksum

5a7c567

Use Encoding.GetByteCount as fast path

88f1b50

Use escape sequences on hardcoded UTF8 characters

950313c

jozkee requested a review from stephentoub September 20, 2022 17:13

Fix ustar prefix logic and throw if name would be truncated

ceab870

stephentoub reviewed Sep 22, 2022

View reviewed changes