Skip to content

Headers are unnecessary encoded to ByteString if there is no Transfer-Encoding: chunked header #42579

@AlttiRi

Description

@AlttiRi

Version

v17.5.0

Platform

Windows 10

Subsystem

http

What steps will reproduce the bug?

Run this http server:

import http from "http";

const name = `Rock & roll 音楽 («🎵🎶»).txt`; // Also is used as the response body

const defaultCdHeader = getContentDisposition(name);
// console.log(defaultCdHeader === Buffer.from(`inline; filename="${name}"`).toString("binary")); // true

const host = "localhost";
const port = 8000;
const origin = `http://${host}:${port}`;
const server = http.createServer(requestListener);
server.listen(port, host, () => {
    console.log("Server is running on:     " + origin + "       (open the web page)");
    console.log("Downloads the file:       " + origin + "/?dl=1");
    console.log("Remove Transfer-Encoding: " + origin + "/?te=0" + " (open the web page)");
    console.log("Downloads w/o T-E header: " + origin + "/?dl=1&te=0");
});

function requestListener(req, res) {
    let cdHeader = defaultCdHeader;
    const {dl, te} = Object.fromEntries(new URL(origin + req.url).searchParams.entries());
    if (dl === "1") {
        cdHeader = getContentDisposition(name, {type: "attachment"});
    }

    res.setHeader("Content-Disposition", cdHeader);
    // The header must be a `ByteString`. For example, I can't do that:
    // res.setHeader("Header-X", name); // TypeError [ERR_INVALID_CHAR]: Invalid character in header content ["Header-X"]

    if (te === "0") { // Note: "Transfer-Encoding: chunked" is set by default.
        res.removeHeader("Transfer-Encoding"); // Any other TE header's value has the same effect as the header removing
        const byteCount = new TextEncoder().encode(name).length;
        res.setHeader("Content-Length", byteCount);
    }
    res.setHeader("Content-Type", "text/html; charset=utf-8");
    res.writeHead(200);
    res.end(name);
}

// Note: In case of this issue you can ignore the follow function code. It just returns C-D header as `ByteString`.
/**
 * Simple implementation for getting "Content-Disposition" header from filename
 * @example
 * // By default, it produces the same result as it (with replacing all double-quotes by underscore):
 * Buffer.from(`inline; filename="${name}"`).toString("binary")
 *
 * @param {string} name
 * @param {Object} opts
 * @param {"inline"|"attachment"} [opts.type="inline"]
 * @param {Boolean} [opts.encoded=false]
 * @param {Boolean} [opts.filename=true]
 * @param {String} [opts.replacer="_"]
 * @return {string}
 */
function getContentDisposition(name, opts = {}) {
    const {
        type, encoded, filename, replacer
    } = Object.assign({type: "inline", encoded: false, filename: true, replacer: "_"}, opts);
    const fixName = (name) => name.replaceAll(`"`, replacer); // The most trivial fix, since it uses the quoted filename. Prevents from the incorrect header parsing. For, example: `";"` (3 chars in ` quotes).
    const encodeMap = new Map([["'", "%27"], ["(", "%28"], [")", "%29"], ["*", "%30"]]); // Required to escape it for old browsers (For example, a Chromium of 2013)
    const getEncodedName = (name) => encodeURIComponent(name).replaceAll(/['()*]/g, (val) => encodeMap.get(val));
    const encodedStr = encoded ? `filename*=UTF-8''${getEncodedName(name)}` : "";
    const filenameStr = filename ? `filename="${fixName(name)}"` : "";
    const header = [type, filenameStr, encodedStr].filter(x => x).join("; ");
    return Buffer.from(header).toString("binary");
}
  1. Click on the link http://localhost:8000/?dl=1 to download a file with Rock & roll 音楽 («🎵🎶»).txt name.
  2. Click on the link http://localhost:8000/?dl=1&te=0 to download a file with Rock & roll 音楽 («🎵🎶»).txt name (in this case "Transfer-Encoding" header will be removed)

How often does it reproduce? Is there a required condition?

Always.

What is the expected behavior?

Both files are downloaded with Rock & roll 音楽 («🎵🎶»).txt name.

What do you see instead?

The first file has the correct name — Rock & roll 音楽 («🎵🎶»).txt.
The second one has wrong name — Rock & roll é_³æ¥½ («ð__µð__¶Â»).txt

Additional information

TL;DR

If there is "Transfer-Encoding: chunked" header (exactly chunked) setHeader works properly, it sets the input header (ByteString) as is.
(Note: "Transfer-Encoding: chunked" is set by default.)

In any other case it additionally (unnecessary) encodes the header to ByteString.
So, the header is encoded twice, that is wrong.

Additional info

The most of HTTP headers are contains only ASCII characters. But when you need to put in a header (For example, "Content-Disposition", or any custom header) a string that contains non-ASCII* character(s), you can't just put it in as issetHeader.
For example:

// TypeError [ERR_INVALID_CHAR]: Invalid character in header content ["Header-X"]
res.setHeader("Content-Disposition", `attachment; filename="Rock & roll 音楽 («🎵🎶»).txt"`);

A HTTP header is a Binary String (ByteString) — UTF-8 bytes within String object.*

There is no problem with the headers which contain only ASCII characters, since ASCII charset is subset of UTF-8 and Latin 1 encodings, so toByteString(ASCIIString) === ASCIIString.

To get a ByteString from USVString you just need to take UTF-8 bytes from an input string then represent them in Latin 1 (ISO-8859-1) encoding.

For example, in Node.js:

function toByteString(string) {
    return Buffer.from(string).toString("binary"); // or "latin1"
}

*To be honest, the entire quote of [ByteString](https://webidl.spec.whatwg.org/#idl-ByteString:

Such sequences might be interpreted as UTF-8 encoded strings [RFC3629] or strings in some other 8-bit-per-code-unit encoding, although this is not required.

As I can see, a browser also can detect if the string is "just" 8859-1, not UTF-8 bytes encoded in 8859-1.

So, both "Content-Disposition" headers are valid "valid"**:

res.setHeader("Content-Disposition", `attachment; filename="¡«£»÷ÿ.png"`); // Correct ONLY for the some certain browser's languages
res.setHeader("Content-Disposition", `attachment; filename="${toByteString("¡«£»÷ÿ.png")}"`); // Always is correct

The result in both "both"** cases is a file with "¡«£»÷ÿ.png"** name, even while "¡«£»÷ÿ.png" !== toByteString("¡«£»÷ÿ.png").

UPDATE:
**Using non-UTF-8 bytes ("some other 8-bit-per-code-unit encoding") in ByteString is browser/OS language dependent!

For example, in Firefox with non-EN language using of "¡«£»÷ÿ.png" as is (without toByteString()) results to ������.png filename, instead of ¡«£»÷ÿ.png
In Chrome it will be Ў«Ј»чя.png for Cyrillic.

So, I think it (using of 8859-1 in "usual way") should be highly unrecommended.
Headers should always be a ByteString with only UTF-8 bytes represented as 8859-1 (Latin 1).

Problem

The problem is that I can't correctly set a header that is a ByteString (UTF-8 bytes in Latin 1) if the original string contains non-ASCII characters.
Like the other servers do it.

The problem appears only when the Transfer-Encoding: chunked header (which is present by default) is removed (or changed).
In this case setHeader encodes the header to Binary String.

That is unnecessary, since it's already a ByteString.

It's not possible to put in setHeader a USVString, since in this case it will throw TypeError [ERR_INVALID_CHAR]: Invalid character in header content error.

So, the header is encoded to "binary" twice, and browsers download the file with the wrong filenames:
Rock & roll é_³æ¥½ («ð__µð__¶Â»).txt instead of Rock & roll 音楽 («🎵🎶»).txt.

You can open the demo server with disabled Transfer-Encoding: chunked header (http://localhost:8000/?te=0 ) and check it:

function bSrt2Str(bString) {
    return new TextDecoder().decode(binaryStringToArrayBuffer(bString));
}
function binaryStringToArrayBuffer(binaryString) {
    const u8Array = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        u8Array[i] = binaryString.charCodeAt(i);
    }
    return u8Array;
}
let cd = (await fetch("")).headers.get("Content-Disposition");
console.log(cd);
console.log(bSrt2Str(cd));
console.log(bSrt2Str(bSrt2Str(cd)));

image

The header is encoded twice!


Examples

A lot of forums encodes headers such way for the attached files (XenForo, vBulletin, for example).

The real life examples:

Oh, wait, it requires an account, if you don't have/(want to create an account), just use my demo server.
Anyway, just look at the screenshots below.

In the browser console you can verify that header are ByteString:

function bSrt2Str(bString) {
    return new TextDecoder().decode(binaryStringToArrayBuffer(bString));
}
function binaryStringToArrayBuffer(binaryString) {
    const u8Array = new Uint8Array(binaryString.length);
    for (let i = 0; i < binaryString.length; i++) {
        u8Array[i] = binaryString.charCodeAt(i);
    }
    return u8Array;
}
let cd = (await fetch("")).headers.get("Content-Disposition");
console.log(cd);
console.log(bSrt2Str(cd));

image

image


As a bonus, here is an example of Java server made with ServerSocket which also works properly:

Main.java
package com.company;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) {
        try (ServerSocket serverSocket = new ServerSocket(8000)) {
            System.out.println("Sever run on http://127.0.0.1:8000");
            while (true) {
                Socket socket = serverSocket.accept();
                try (BufferedReader input = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
                     PrintWriter output = new PrintWriter(socket.getOutputStream())) {
                    while (input.ready()) { // Print headers
                        System.out.println(input.readLine());
                    }
                    String name = "Rock & roll 音楽 («\uD83C\uDFB5\uD83C\uDFB6»).txt";
                    System.out.println(name);

                    output.println("HTTP/1.1 200 OK");
                    output.println("Content-Type: text/html; charset=utf-8");
                    output.println("Content-Disposition: attachment; filename=\"" + name + "\"");
                    output.println();
                    output.println(name);
                    output.flush();
                }
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    httpIssues or PRs related to the http subsystem.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions