Explicit array conversion (e.g., array(), asarray()) #122

shoyer · 2021-01-30T23:07:17Z

Reading through the standard, it appears that we may have missed an important feature: the ability to explicit coerce objects into a desired array type, either from builtin Python types like float/list or other array libraries. In other words, we need something like NumPy's array() and/or asarray() functions.

The text was updated successfully, but these errors were encountered:

rgommers · 2021-01-31T19:58:46Z

We kind of left that out on purpose, because there's so much variation in how libraries do that. We've got

tf.convert_to_tensor
torch.tensor
numpy/cupy/dask/mxnet/jax.numpy .asarray

Those also have significant variation in what they accept (e.g. do they deal with generators, objects which implement the buffer protocol, etc.).

The idea was:

Users can do this with whatever function(s) their library provides. Then they only pass array objects to functions in the array API namespace
Libraries should normally not have to construct anything from lists, and should be avoiding asarray.

Perhaps we should reconsider, being able to do array([1, 3], [5, 99]]) is a reasonable ask.

shoyer · 2021-01-31T22:18:32Z

Perhaps we should reconsider, being able to do array([1, 3], [5, 99]]) is a reasonable ask.

I agree. If we have other array creation functions, I expect this will feel like an obvious missing gap. Otherwise I expect we would see users writing code like stack([ones(()), -2 * ones(()), ones(())]) rather than the much clearer asarray([1, -2, 1]). Use cases for the later sort of thing come up, even in library code (e.g., for calculating a Laplacian).

For what it's worth, in jax.numpy there are essentially two versions of asarray():

The public version of jax.numpy.asarray works like NumPy, converting lists, calling __array__(), etc.
The internal version (called inside jax.numpy functions for coercion) allows for explicit NumPy and JAX arrays, but nothing else.

So the existence of asarray() does not mean it need to be used :)

rgommers · 2021-02-01T16:22:16Z

Makes sense. We need only one function I think - asarray and array are almost identical. I'd suggest a mash-up, because neither is ideal:

asarray(obj, dtype=None, copy=False, order='K', ndmin=0)

Some thoughts:

subok as a keyword doesn't make too much sense to me, subclasses should be fine if the library allows them.
order and ndmin make sense for asarray too even though numpy doesn't have them in asarray.
what inputs types obj accepts should be very well specified, and stricter than what numpy does. Probably: sequences, generators, and anything with a __dlpack__ method.
Anything that doesn't produce a supported dtype should raise an exception.
No buffer protocol, __array__ or __array_interface__.

shoyer · 2021-02-01T18:41:18Z

order and ndmin make sense for asarray too even though numpy doesn't have them in asarray.

order is implementation specific to NumPy's strided ndarray model. It would not make sense for libraries like TensorFlow/JAX (and likely others, too) that only support contiguous arrays.

On the other hand, a device argument for explicitly assigning where the new array is allocated feels prudent.

ndmin is really weird, particularly because how it manages to ensure a minimum number of dimensions is ambiguous without reading the documentation. I've never used it with numpy, and would suggest leaving it out of the standard.

what inputs types obj accepts should be very well specified, and stricter than what numpy does. Probably: sequences, generators, and anything with a __dlpack__ method.

I would skip generators -- they have unknown size, which means the resulting arrays can't be allocated at once. It's easy enough to require users to cast with list() first.

No buffer protocol.

I'm not sure it's worth dropping the buffer protocol. It's used all over the place, including in Python's standard library, and it works just fine (especially for numeric types). Consider a library like Pillow -- do they really gain anything from implementing __dlpack__?

leofang · 2021-02-01T18:51:20Z

I like the ideas of

A mash-up asarray()
Dropping ndmin
Accepting obj to be NestedSequence[int | float | complex] | SupportsDLPack and skip generators
Keeping the Python buffer protocol alive. On CPUs it works fine and mpi4py wraps everything in Python buffers, including GPU buffers. Though we'd likely support DLPack (by wrapping a DLTensor) anyway.

leofang · 2021-02-01T18:55:41Z

@shoyer One thing you said is in conflict: If we'd like to support the buffer protocol, it seems to be the best to keep order, which Python also recognizes (Python and NumPy use the same stride model IIUC).

rgommers · 2021-02-01T19:06:39Z

order is implementation specific to NumPy's strided ndarray model. It would not make sense for libraries like TensorFlow/JAX (and likely others, too) that only support contiguous arrays.

It doesn't hurt though, and it can help. Even contiguous-only arrays can have C and Fortran order for ndim > 1. I think you mean TF/JAX only support automatically choosing order under the hood?

device: yes, I forgot to add that.

ndmin is really weird, particularly because how it manages to ensure a minimum number of dimensions is ambiguous without reading the documentation. I've never used it with numpy, and would suggest leaving it out of the standard.

I was thinking that is because numpy has atleast_1/2/3d, and those are quite popular. And we don't have them in the standard.

skip generators

Agree, the unknown size is a good argument to drop them.

I'm not sure it's worth dropping the buffer protocol. It's used all over the place, including in Python's standard library, and it works just fine (especially for numeric types). Consider a library like Pillow -- do they really gain anything from implementing __dlpack__?

Keeping the Python buffer protocol alive. On CPUs it works fine and mpi4py wraps everything in Python buffers, including GPU buffers. Though we'd likely support DLPack (by wrapping a DLTensor) anyway.

The trouble is, if we include it then we are mandating everyone to implement support for it. Which is a pain. Mpi4py and Pillow could easily document that users should convert to a numpy or cupy array as intermediate.

Also considering that in downstream library functions we anyway only want to accept conforming array objects and not mpi4py/Pillow objects, that's a very minor thing to ask. On the other hand, making array libraries implement the buffer protocol just for asarray seems odd. Especially since it would then be the only feature in the whole standard that requires using the Python C API.

shoyer · 2021-02-01T19:15:39Z

order is implementation specific to NumPy's strided ndarray model. It would not make sense for libraries like TensorFlow/JAX (and likely others, too) that only support contiguous arrays.

It doesn't hurt though, and it can help. Even contiguous-only arrays can have C and Fortran order for ndim > 1. I think you mean TF/JAX only support automatically choosing order under the hood?

That's right, it's not part of the user facing API.

(More specifically, I was thinking TF/JAX only support C order arrays, but that may actually be an implementation detail that is not necessarily true on all platforms...)

ndmin is really weird, particularly because how it manages to ensure a minimum number of dimensions is ambiguous without reading the documentation. I've never used it with numpy, and would suggest leaving it out of the standard.

I was thinking that is because numpy has atleast_1/2/3d, and those are quite popular. And we don't have them in the standard.

TensorFlow seems to do just fine without either.

Another reason for why ndmin is a poor fit that convert an argument into an array is orthogonal to ensuring it has a certain number of dimensions. If we really care about this functionality, I would rather include it in the form of separate functions like atleast_1/2/3d.

rgommers · 2021-02-01T19:24:26Z

(More specifically, I was thinking TF/JAX only support C order arrays, but that may actually be an implementation detail that is not necessarily true on all platforms...

Pretty sure they're passing Fortran-ordered arrays to LAPACK implementations, better for performance.

TensorFlow seems to do just fine without either.

That's good to know. It's easy enough to do some other way, e.g. with reshape. The behaviour is also puzzling:

>>> x = np.arange(3)                                                        
>>> np.atleast_2d(x).shape                                                  
(1, 3)
>>> np.atleast_3d(x).shape                                                  
(1, 3, 1)                                  # why not (1, 1, 3)?

rgommers · 2021-02-01T19:26:41Z

Okay, so I think we're arriving at:

asarray(obj, /, *, dtype=None, copy=False, device=None)

rgommers · 2021-02-01T19:36:18Z

Especially since it would then be the only feature in the whole standard that requires using the Python C API.

Actually, I take that back. They're documented as part of the C API, however (from here):
Contrary to most data types exposed by the Python interpreter, buffers are not PyObject pointers but rather simple C structures. This allows them to be created and copied very simply. When a generic wrapper around a buffer is needed, a memoryview object can be created.

rgommers · 2021-02-01T19:39:06Z

I'm not sure either way now, the docs are very bad. It claims they're simple C structures, but then all signatures contain Py* types.

rgommers · 2021-02-01T21:00:57Z

Got a good answer thanks to Pearu: using memoryview(x) in Python will work if and only if an object supports the buffer protocol. So it does not require the Python C API. That makes it less onerous to support, at least in principle (in practice, all libraries will implement it via Python C API calls I'd expect).

leofang · 2021-02-02T02:26:50Z

Okay, so I think we're arriving at:
asarray(obj, /, *, dtype=None, copy=False, device=None)

So we drop order?

rgommers · 2021-02-02T08:27:49Z

So we drop order?

I don't have a strong opinion either way, but finds @shoyer's argument mildly convincing - if it's a feature that JAX and TensorFlow do not expose on purpose, then they'd have a keyword that they will just ignore.

On the other hand, NumPy/CuPy/PyTorch/Dask/MXNet all support it just fine, and there's no user-noticeable effort if JAX/TF ignored it.

rgommers · 2021-02-02T08:28:27Z

Re buffer protocol: if we support it, then we should probably also support __array_interface__ - that's the Python-level equivalent.

oleksandr-pavlyk · 2021-02-02T15:31:12Z

__array_interface__ has been largely superseded by buffer protocol (https://numpy.org/doc/stable/reference/arrays.interface.html) , so what is the benefit of having it?

oleksandr-pavlyk · 2021-02-02T15:34:45Z

The trouble with order keyword is whether we mandate what values of order a library must support.

If a library only supports C-contiguous arrays, there can only be one value (default), and the keyword wont get used.

For libraries with strided support order can be useful, but what about portability across library implementations ?

rgommers · 2021-02-02T15:48:10Z

__array_interface__ has been largely superseded by buffer protocol (https://numpy.org/doc/stable/reference/arrays.interface.html) , so what is the benefit of having it?

Not superceded, the buffer protocol is C-only and __array_interface__ is Python-only. They do very similar things.

rgommers · 2021-02-02T15:50:30Z

For libraries with strided support order can be useful, but what about portability across library implementations ?

It is possible for JAX/TF to accept and just ignore order='F', because there is no other user-accessible feature that depends on it. It's really just a performance optimization detail.

rgommers · 2021-02-17T17:12:39Z

Also, if a library only has C ordered arrays, then this all still makes sense:

    'K'   unchanged F & C order preserved, otherwise most similar order
    'A'   unchanged F order if input is F and not C, otherwise C order
    'C'   C order   C order

Closes data-apisgh-122

rgommers · 2021-02-17T19:59:26Z

Opened gh-130 to add asarray

Closes gh-122

shoyer mentioned this issue Jan 30, 2021

Questions related to adoption #120

Closed

rgommers self-assigned this Feb 17, 2021

rgommers added a commit to rgommers/array-api that referenced this issue Feb 17, 2021

Add specification for asarray

2257725

Closes data-apisgh-122

rgommers mentioned this issue Feb 17, 2021

Add specification for asarray #130

Merged

rgommers closed this as completed in #130 Feb 21, 2021

rgommers added a commit that referenced this issue Feb 21, 2021

Add specification for asarray (#130)

e1870ff

Closes gh-122

thomasjpfan mentioned this issue Mar 12, 2022

ENH Adds Array API support to LinearDiscriminantAnalysis scikit-learn/scikit-learn#22554

Merged

honno mentioned this issue Oct 28, 2022

Optionally generate noncontiguous arrays HypothesisWorks/hypothesis#3489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit array conversion (e.g., array(), asarray()) #122

Explicit array conversion (e.g., array(), asarray()) #122

shoyer commented Jan 30, 2021

rgommers commented Jan 31, 2021

shoyer commented Jan 31, 2021 •

edited

Loading

rgommers commented Feb 1, 2021

shoyer commented Feb 1, 2021

leofang commented Feb 1, 2021 •

edited

Loading

leofang commented Feb 1, 2021 •

edited

Loading

rgommers commented Feb 1, 2021

shoyer commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021 •

edited

Loading

leofang commented Feb 2, 2021

rgommers commented Feb 2, 2021

rgommers commented Feb 2, 2021

oleksandr-pavlyk commented Feb 2, 2021

oleksandr-pavlyk commented Feb 2, 2021

rgommers commented Feb 2, 2021

rgommers commented Feb 2, 2021

rgommers commented Feb 17, 2021

rgommers commented Feb 17, 2021

Explicit array conversion (e.g., array(), asarray()) #122

Explicit array conversion (e.g., array(), asarray()) #122

Comments

shoyer commented Jan 30, 2021

rgommers commented Jan 31, 2021

shoyer commented Jan 31, 2021 • edited Loading

rgommers commented Feb 1, 2021

shoyer commented Feb 1, 2021

leofang commented Feb 1, 2021 • edited Loading

leofang commented Feb 1, 2021 • edited Loading

rgommers commented Feb 1, 2021

shoyer commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021

rgommers commented Feb 1, 2021 • edited Loading

leofang commented Feb 2, 2021

rgommers commented Feb 2, 2021

rgommers commented Feb 2, 2021

oleksandr-pavlyk commented Feb 2, 2021

oleksandr-pavlyk commented Feb 2, 2021

rgommers commented Feb 2, 2021

rgommers commented Feb 2, 2021

rgommers commented Feb 17, 2021

rgommers commented Feb 17, 2021

shoyer commented Jan 31, 2021 •

edited

Loading

leofang commented Feb 1, 2021 •

edited

Loading

leofang commented Feb 1, 2021 •

edited

Loading

rgommers commented Feb 1, 2021 •

edited

Loading