Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit array conversion (e.g., array(), asarray()) #122

Closed
shoyer opened this issue Jan 30, 2021 · 22 comments · Fixed by #130
Closed

Explicit array conversion (e.g., array(), asarray()) #122

shoyer opened this issue Jan 30, 2021 · 22 comments · Fixed by #130
Assignees

Comments

@shoyer
Copy link
Contributor

shoyer commented Jan 30, 2021

Reading through the standard, it appears that we may have missed an important feature: the ability to explicit coerce objects into a desired array type, either from builtin Python types like float/list or other array libraries. In other words, we need something like NumPy's array() and/or asarray() functions.

@rgommers
Copy link
Member

We kind of left that out on purpose, because there's so much variation in how libraries do that. We've got

  • tf.convert_to_tensor
  • torch.tensor
  • numpy/cupy/dask/mxnet/jax.numpy .asarray

Those also have significant variation in what they accept (e.g. do they deal with generators, objects which implement the buffer protocol, etc.).

The idea was:

  • Users can do this with whatever function(s) their library provides. Then they only pass array objects to functions in the array API namespace
  • Libraries should normally not have to construct anything from lists, and should be avoiding asarray.

Perhaps we should reconsider, being able to do array([1, 3], [5, 99]]) is a reasonable ask.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 31, 2021

Perhaps we should reconsider, being able to do array([1, 3], [5, 99]]) is a reasonable ask.

I agree. If we have other array creation functions, I expect this will feel like an obvious missing gap. Otherwise I expect we would see users writing code like stack([ones(()), -2 * ones(()), ones(())]) rather than the much clearer asarray([1, -2, 1]). Use cases for the later sort of thing come up, even in library code (e.g., for calculating a Laplacian).

For what it's worth, in jax.numpy there are essentially two versions of asarray():

  • The public version of jax.numpy.asarray works like NumPy, converting lists, calling __array__(), etc.
  • The internal version (called inside jax.numpy functions for coercion) allows for explicit NumPy and JAX arrays, but nothing else.

So the existence of asarray() does not mean it need to be used :)

@rgommers
Copy link
Member

rgommers commented Feb 1, 2021

Makes sense. We need only one function I think - asarray and array are almost identical. I'd suggest a mash-up, because neither is ideal:

asarray(obj, dtype=None, copy=False, order='K', ndmin=0)

Some thoughts:

  • subok as a keyword doesn't make too much sense to me, subclasses should be fine if the library allows them.
  • order and ndmin make sense for asarray too even though numpy doesn't have them in asarray.
  • what inputs types obj accepts should be very well specified, and stricter than what numpy does. Probably: sequences, generators, and anything with a __dlpack__ method.
  • Anything that doesn't produce a supported dtype should raise an exception.
  • No buffer protocol, __array__ or __array_interface__.

@shoyer
Copy link
Contributor Author

shoyer commented Feb 1, 2021

  • order and ndmin make sense for asarray too even though numpy doesn't have them in asarray.

order is implementation specific to NumPy's strided ndarray model. It would not make sense for libraries like TensorFlow/JAX (and likely others, too) that only support contiguous arrays.

On the other hand, a device argument for explicitly assigning where the new array is allocated feels prudent.

ndmin is really weird, particularly because how it manages to ensure a minimum number of dimensions is ambiguous without reading the documentation. I've never used it with numpy, and would suggest leaving it out of the standard.

  • what inputs types obj accepts should be very well specified, and stricter than what numpy does. Probably: sequences, generators, and anything with a __dlpack__ method.

We also need builtin Python scalars: int | float | complex. So the type is something like NestedSequence[int | float | complex] | SupportsDLPack.

I would skip generators -- they have unknown size, which means the resulting arrays can't be allocated at once. It's easy enough to require users to cast with list() first.

  • No buffer protocol.

I'm not sure it's worth dropping the buffer protocol. It's used all over the place, including in Python's standard library, and it works just fine (especially for numeric types). Consider a library like Pillow -- do they really gain anything from implementing __dlpack__?

@leofang
Copy link
Contributor

leofang commented Feb 1, 2021

I like the ideas of

  • A mash-up asarray()
  • Dropping ndmin
  • Accepting obj to be NestedSequence[int | float | complex] | SupportsDLPack and skip generators
  • Keeping the Python buffer protocol alive. On CPUs it works fine and mpi4py wraps everything in Python buffers, including GPU buffers. Though we'd likely support DLPack (by wrapping a DLTensor) anyway.

@leofang
Copy link
Contributor

leofang commented Feb 1, 2021

@shoyer One thing you said is in conflict: If we'd like to support the buffer protocol, it seems to be the best to keep order, which Python also recognizes (Python and NumPy use the same stride model IIUC).

@rgommers
Copy link
Member

rgommers commented Feb 1, 2021

order is implementation specific to NumPy's strided ndarray model. It would not make sense for libraries like TensorFlow/JAX (and likely others, too) that only support contiguous arrays.

It doesn't hurt though, and it can help. Even contiguous-only arrays can have C and Fortran order for ndim > 1. I think you mean TF/JAX only support automatically choosing order under the hood?

device: yes, I forgot to add that.

ndmin is really weird, particularly because how it manages to ensure a minimum number of dimensions is ambiguous without reading the documentation. I've never used it with numpy, and would suggest leaving it out of the standard.

I was thinking that is because numpy has atleast_1/2/3d, and those are quite popular. And we don't have them in the standard.

skip generators

Agree, the unknown size is a good argument to drop them.

I'm not sure it's worth dropping the buffer protocol. It's used all over the place, including in Python's standard library, and it works just fine (especially for numeric types). Consider a library like Pillow -- do they really gain anything from implementing __dlpack__?

Keeping the Python buffer protocol alive. On CPUs it works fine and mpi4py wraps everything in Python buffers, including GPU buffers. Though we'd likely support DLPack (by wrapping a DLTensor) anyway.

The trouble is, if we include it then we are mandating everyone to implement support for it. Which is a pain. Mpi4py and Pillow could easily document that users should convert to a numpy or cupy array as intermediate.

Also considering that in downstream library functions we anyway only want to accept conforming array objects and not mpi4py/Pillow objects, that's a very minor thing to ask. On the other hand, making array libraries implement the buffer protocol just for asarray seems odd. Especially since it would then be the only feature in the whole standard that requires using the Python C API.

@shoyer
Copy link
Contributor Author

shoyer commented Feb 1, 2021

order is implementation specific to NumPy's strided ndarray model. It would not make sense for libraries like TensorFlow/JAX (and likely others, too) that only support contiguous arrays.

It doesn't hurt though, and it can help. Even contiguous-only arrays can have C and Fortran order for ndim > 1. I think you mean TF/JAX only support automatically choosing order under the hood?

That's right, it's not part of the user facing API.

(More specifically, I was thinking TF/JAX only support C order arrays, but that may actually be an implementation detail that is not necessarily true on all platforms...)

ndmin is really weird, particularly because how it manages to ensure a minimum number of dimensions is ambiguous without reading the documentation. I've never used it with numpy, and would suggest leaving it out of the standard.

I was thinking that is because numpy has atleast_1/2/3d, and those are quite popular. And we don't have them in the standard.

TensorFlow seems to do just fine without either.

Another reason for why ndmin is a poor fit that convert an argument into an array is orthogonal to ensuring it has a certain number of dimensions. If we really care about this functionality, I would rather include it in the form of separate functions like atleast_1/2/3d.

@rgommers
Copy link
Member

rgommers commented Feb 1, 2021

(More specifically, I was thinking TF/JAX only support C order arrays, but that may actually be an implementation detail that is not necessarily true on all platforms...

Pretty sure they're passing Fortran-ordered arrays to LAPACK implementations, better for performance.

TensorFlow seems to do just fine without either.

That's good to know. It's easy enough to do some other way, e.g. with reshape. The behaviour is also puzzling:

>>> x = np.arange(3)                                                        
>>> np.atleast_2d(x).shape                                                  
(1, 3)
>>> np.atleast_3d(x).shape                                                  
(1, 3, 1)                                  # why not (1, 1, 3)?

@rgommers
Copy link
Member

rgommers commented Feb 1, 2021

Okay, so I think we're arriving at:

asarray(obj, /, *, dtype=None, copy=False, device=None)

@rgommers
Copy link
Member

rgommers commented Feb 1, 2021

Especially since it would then be the only feature in the whole standard that requires using the Python C API.

Actually, I take that back. They're documented as part of the C API, however (from here):
Contrary to most data types exposed by the Python interpreter, buffers are not PyObject pointers but rather simple C structures. This allows them to be created and copied very simply. When a generic wrapper around a buffer is needed, a memoryview object can be created.

@rgommers
Copy link
Member

rgommers commented Feb 1, 2021

I'm not sure either way now, the docs are very bad. It claims they're simple C structures, but then all signatures contain Py* types.

@rgommers
Copy link
Member

rgommers commented Feb 1, 2021

Got a good answer thanks to Pearu: using memoryview(x) in Python will work if and only if an object supports the buffer protocol. So it does not require the Python C API. That makes it less onerous to support, at least in principle (in practice, all libraries will implement it via Python C API calls I'd expect).

@leofang
Copy link
Contributor

leofang commented Feb 2, 2021

Okay, so I think we're arriving at:


asarray(obj, /, *, dtype=None, copy=False, device=None)

So we drop order?

@rgommers
Copy link
Member

rgommers commented Feb 2, 2021

So we drop order?

I don't have a strong opinion either way, but finds @shoyer's argument mildly convincing - if it's a feature that JAX and TensorFlow do not expose on purpose, then they'd have a keyword that they will just ignore.

On the other hand, NumPy/CuPy/PyTorch/Dask/MXNet all support it just fine, and there's no user-noticeable effort if JAX/TF ignored it.

@rgommers
Copy link
Member

rgommers commented Feb 2, 2021

Re buffer protocol: if we support it, then we should probably also support __array_interface__ - that's the Python-level equivalent.

@oleksandr-pavlyk
Copy link
Contributor

__array_interface__ has been largely superseded by buffer protocol (https://numpy.org/doc/stable/reference/arrays.interface.html) , so what is the benefit of having it?

@oleksandr-pavlyk
Copy link
Contributor

The trouble with order keyword is whether we mandate what values of order a library must support.

If a library only supports C-contiguous arrays, there can only be one value (default), and the keyword wont get used.

For libraries with strided support order can be useful, but what about portability across library implementations ?

@rgommers
Copy link
Member

rgommers commented Feb 2, 2021

__array_interface__ has been largely superseded by buffer protocol (https://numpy.org/doc/stable/reference/arrays.interface.html) , so what is the benefit of having it?

Not superceded, the buffer protocol is C-only and __array_interface__ is Python-only. They do very similar things.

@rgommers
Copy link
Member

rgommers commented Feb 2, 2021

For libraries with strided support order can be useful, but what about portability across library implementations ?

It is possible for JAX/TF to accept and just ignore order='F', because there is no other user-accessible feature that depends on it. It's really just a performance optimization detail.

@rgommers
Copy link
Member

Also, if a library only has C ordered arrays, then this all still makes sense:

    'K'   unchanged F & C order preserved, otherwise most similar order
    'A'   unchanged F order if input is F and not C, otherwise C order
    'C'   C order   C order

@rgommers rgommers self-assigned this Feb 17, 2021
rgommers added a commit to rgommers/array-api that referenced this issue Feb 17, 2021
@rgommers
Copy link
Member

Opened gh-130 to add asarray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants