Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter recycling rules #98

Open
sebffischer opened this issue Mar 19, 2024 · 5 comments
Open

Stricter recycling rules #98

sebffischer opened this issue Mar 19, 2024 · 5 comments
Labels
meta-proposal Language proposals status-feedback-needed Needs further discussion before moving further theme-internals Relates to internal operations of the language type-design Discussion regarding design of enhancements or project at large

Comments

@sebffischer
Copy link
Collaborator

Currently, the code below works:

> c(1, 2) * c(1, 2, 3)
[1] 1 4 3

I think this should be stricter and at least throw a warning (but I would probably even tend to an error) when the shorter length does not divide the longer length.

@sebffischer
Copy link
Collaborator Author

sebffischer commented Mar 19, 2024

Maybe one wants to be even stricter, i.e. in a operation x + y, either length(x) == length(y) or length(x) == 1 | length(y) == 1. This would be in line with the broadcasting rules implemented in numpy or pytorch.

A strong argument for being stricter is that the recycling rules for vectors are then more consistent with the recycling rules for n-dimensional arrays. For example, it should also not be possible to multiply two matrices with shapes (2, 3) and (2, 4).

@dgkf
Copy link
Owner

dgkf commented Mar 19, 2024

Great catch! I remember thinking about this when implementing the current recycling... and promptly forgetting about it.

This style works great for objects that have a known length... which for now should account for all objects. But my goal is that objects eventually are calculated lazily, and possibly even from generators where the length is unknown.

All operators are currently set up with this laziness in mind. They are implemented on any representation that implements IntoIterator, with the goal that eventually a "vector" will be able to be represented as a generator.

That said, we can add something that is like... try_check_recycle_size_compatibility that checks for the size when both Iterator representations have a size hint, throwing an error if both sizes are known but incompatible, but otherwise silently continues with iterators of unknown size until both iterators are exhausted - which point we will know their size and can raise an appropriate error.

@dgkf dgkf added theme-internals Relates to internal operations of the language type-design Discussion regarding design of enhancements or project at large meta-proposal Language proposals status-feedback-needed Needs further discussion before moving further labels Mar 19, 2024
@sebffischer
Copy link
Collaborator Author

sebffischer commented Mar 24, 2024

One thing that one also has to pay attention to is to make the lazy evaluation work correctly with error handling.

In the code below (assuming stricter recycling rules), the multiplication x * y should throw an error.
However, because y is created from masking 1:3 it should have unknown size (unless we want to calculate the sum of the logical vector, which I guess we don't).
This means, that when calling tryCatch(), the length of y is unknown and one cannot perform compliance checks with the recycling rules when constructing z.
Only later, when mean() will be called on z, would we notice that something went wrong.
However, the stop() call comes before that, so the error handling code would never be executed.

y <- (1:3)[c(TRUE, TRUE, TRUE)]
x <- 1:10
z <- tryCatch(x * y,  # this should throw
  error = function(e) {
     messagef("Catch all them errors")
     NA
  }
)

stop("wupsie")

mean(z)

@sebffischer
Copy link
Collaborator Author

I have started to think about this again and there is another relevant case that is also relevant to error-handling.
This is mostly related to your suggestion from above:

That said, we can add something that is like... try_check_recycle_size_compatibility that checks for the size when both Iterator representations have a size hint, throwing an error if both sizes are known but incompatible, but otherwise silently continues with iterators of unknown size until both iterators are exhausted - which point we will know their size and can raise an appropriate error.

The one possible problem from above regarding the tryCatch() can in principle be solved for forcing the evaluation of the expression x * y, the same problem will also arise in other contexts I believe.

Let's say we do something like

print(1:3[ [true, true, true] ] + 1:2[ [true, true]])

With the recycling rules that I suggested, this should throw an error, as the lengths of the left and right subsets are 3 and 2.
What I believe should happen here eventually is that

  1. 1:3 and 1:2 are represented using RepType::Iterator
  2. Subsetting the Iterator using the Mask, should adapt those iterators akin to a rust .filter(), creating two other RepType::Iterator representations.
  3. Adding the two filtered iterators produces the final iterator
  4. The print() call then iterates over the final iterator and starts printing [1] 1 4 but only then do we notice that an error should be thrown.

The whole problem becomes worse if we - instead of printing the final vector - subset the first element as shown below:

(1:3[ [true, true, true] ] + 1:2[ [true, true]])[1]

instead of printing the first two elements and then throwing an error, this operation will never notice that the addition of the two masked vectors is an invalid operation according to the stricter recycling rules, because it will only yield the first element of the final iterator.

So if we want both the stricter recycling rules and usage of RepType::Iterator, I think we always need to verify the validity of a vector representation (such as RepType::Iterator) when creating the representation, not later when yielding its elements.

@sebffischer
Copy link
Collaborator Author

@dgkf Do you agree that these examples seem to make it necessary to always have a known length?
I would maybe try to work on that next

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-proposal Language proposals status-feedback-needed Needs further discussion before moving further theme-internals Relates to internal operations of the language type-design Discussion regarding design of enhancements or project at large
Projects
None yet
Development

No branches or pull requests

2 participants