Skip to content

Possible bug when constructing a string from an array of characters #54275

@d-netto

Description

@d-netto

Consider the MWE provided by @tveldhui:

function foo()
    String(UInt8['a' for i in 1:10000000])
end

function test()
    GC.gc(true)
    baseline = Base.gc_live_bytes()
    while true
        foo()
        GC.gc(true)
        print("gc_live_bytes = +$((Base.gc_live_bytes() - baseline)/1024^2)MiB\n")
    end
end

test()

If we run it, gc_live_bytes seems to increase at a rate of 10MB/iteration:

gc_live_bytes = +514.9888172149658MiB
gc_live_bytes = +524.5256814956665MiB
gc_live_bytes = +534.0625457763672MiB
gc_live_bytes = +543.5994100570679MiB
gc_live_bytes = +553.1362743377686MiB
gc_live_bytes = +562.6731386184692MiB
gc_live_bytes = +572.2100028991699MiB
gc_live_bytes = +581.7468671798706MiB
gc_live_bytes = +591.2837314605713MiB
gc_live_bytes = +600.820595741272MiB

I'm not totally familiar with the Julia array implementation, but this seems to be stemming from a bug in the jl_array_to_string implementation:

JL_DLLEXPORT jl_value_t *jl_array_to_string(jl_array_t *a)
{
    size_t len = jl_array_len(a);
    if (len == 0) {
        // this may seem like purely an optimization (which it also is), but it
        // also ensures that calling `String(a)` doesn't corrupt a previous
        // string also created the same way, where `a = StringVector(_)`.
        return jl_an_empty_string;
    }
    if (a->flags.how == 3 && a->offset == 0 && a->elsize == 1 &&
        (jl_array_ndims(a) != 1 ||
         ((a->maxsize + sizeof(void*) + 1 <= GC_MAX_SZCLASS) == (len + sizeof(void*) + 1 <= GC_MAX_SZCLASS)))) {
        jl_value_t *o = jl_array_data_owner(a);
        if (jl_is_string(o)) {
            a->flags.isshared = 1;
            *(size_t*)o = len;
            a->nrows = 0;
            a->length = 0;
            a->maxsize = 0;
            return o;
        }
    }
    a->nrows = 0;
    a->length = 0;
    a->maxsize = 0;
    return jl_pchar_to_string((const char*)jl_array_data(a), len);
}

On the latest commit of our fork at the time of writing (and also on stock 1.10), it seems like the jl_pchar_to_string called at the very end of jl_array_to_string will allocate a separate buffer for itself and copy the contents of the array into it:

JL_DLLEXPORT jl_value_t *jl_pchar_to_string(const char *str, size_t len)
{
    jl_value_t *s = jl_alloc_string(len);
    if (len > 0)
        memcpy(jl_string_data(s), str, len);
    return s;
}

This raises the question: if a new buffer is allocated to the string, and the contents of the array buffer are copied into it, what's the purpose of these three lines?

    a->nrows = 0;
    a->length = 0;
    a->maxsize = 0;

Seems like we should not have them.

Commenting these three lines seems to solve the issue raised in the MWE above, but again, I'm not fully knowledgeable in this part of the codebase to know whether this proposed change makes sense.

Patch:

diff --git a/src/array.c b/src/array.c
index 5226c729d3..628d40a727 100644
--- a/src/array.c
+++ b/src/array.c
@@ -479,9 +479,9 @@ JL_DLLEXPORT jl_value_t *jl_array_to_string(jl_array_t *a)
             return o;
         }
     }
-    a->nrows = 0;
-    a->length = 0;
-    a->maxsize = 0;
+    // a->nrows = 0;
+    // a->length = 0;
+    // a->maxsize = 0;
     return jl_pchar_to_string((const char*)jl_array_data(a), len);
 }

And new results after applying it:

gc_live_bytes = +0.0049285888671875MiB
gc_live_bytes = +0.005049705505371094MiB
gc_live_bytes = +0.0051708221435546875MiB
gc_live_bytes = +0.005414009094238281MiB
gc_live_bytes = +0.005413055419921875MiB
gc_live_bytes = +0.005534172058105469MiB
gc_live_bytes = +0.0056552886962890625MiB
gc_live_bytes = +0.005898475646972656MiB
gc_live_bytes = +0.00589752197265625MiB
gc_live_bytes = +0.006018638610839844MiB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions