-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Context: during the implementation of the Copy-on-Write feature (#48998), there was the idea to make returned arrays read-only for APIs that return underlying arrays (.values, to_numpy(), __array__).
This was initially only done for numpy arrays (the first two PRs), and recently also for columns backed by ExtensionArrays (both for when returning an EA (.values / .array) or returning the EA as a numpy array (to_numpy(), __array__)):
- API / CoW: return read-only numpy arrays in .values/to_numpy() #51082
- CoW: Return read-only array in Index.values #53704
- CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925
The idea behind returning a read-only array is as follows: with Copy-on-Write, the guarantee we provide is that mutating one pandas object (Series, DataFrame) doesn't update another pandas object (whose data is shared as an implementation detail). But users can still easily get a viewing numpy array, and mutate that one. And at that point, we don't have any control over how this mutation propagates (it might update more objects than just the one from which the user obtained it, for example if other Series/DataFrames were sharing data with this object with CoW).
Example to illustrate this:
# creating a dataframe and a derived dataframe through some operation
# (that in this case didn't need to copy)
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df2 = df.sort_values(by="a").reset_index()
# getting a column and mutating this -> CoW gets triggered and only `ser` is changed, not `df`
>>> ser = df["a"]
>>> ser[0] = 100
>>> ser
0 100
1 2
2 3
Name: a, dtype: int64
>>> df
a b
0 1 4
1 2 5
2 3 6
# however, when the code is mutating the numpy array it got from the series (or dataframe)
# (though .values, or np.asarray(ser), etc), then even the derived `df2` is silently mutated
>>> ser = df["a"]
>>> arr = ser.values
>>> arr.flags.writeable = True # <-- this is now needed because we made .values readonly
>>> arr[0] = 100
>>> df2
index a b
0 0 100 4
1 1 2 5
2 2 3 6Right now, with returning read-only arrays, I have to include arr.flags.writeable = True to make this work (otherwise the above example would raise an error in arr[0] = 100 about the array being read-only).
But if we didn't make the returned arrays read-only, this would work, and such mutations of the underlying numpy array would propagate unpredictably to other pandas series/dataframe objects.