-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update concatenated seqs #2947
Update concatenated seqs #2947
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/seqan/seqan3/D5oQYyuWYQ98oUtrPbSSgZqM5gqq |
0e1ee92
to
98f0f44
Compare
Codecov Report
@@ Coverage Diff @@
## master #2947 +/- ##
=======================================
Coverage 98.28% 98.28%
=======================================
Files 267 267
Lines 11466 11474 +8
=======================================
+ Hits 11269 11277 +8
Misses 197 197
Continue to review full report at Codecov.
|
Honestly, I have some difficulties to understand your proposed changes, regarding the semantics of this data structure. I have to admit, that I am not firm with how it was used before, so I might miss some valuable information. So far, I see it as a dynamic container data structure specifically used to store sequences in one consecutive chunk of memory; a Container-of-Containers. It presumably improves memory efficiency by avoiding fragmented allocation, opposed to a regular solution using However, I think that Assuming that what I remember is true, then the only advantage of our container would be if it can compete with this regular stl solution performance-wise and improves the usability, because people might not be used to the pmr stuff, e.g. a memory resource can neither be copied nor moved and needs to live as long as the allocated memory is used by some part of the program. But these changes don't look more usable to me. A In addition, I can't take the value type anymore and store it somewhere else, without taking care of keeping the original concatenated container alive, as it would result in a dangling reference otherwise. And understanding this as a container, I would be very surprised of this kind of behavior. And even more so, it is then just like an alias of a pmr resource, is it not? So using a proxy as reference type nicely encapsulates the intended behavior of having similar behavior to a So what I don't get is, why I can't wrap the returned proxy into a |
OK, so I think we have three questions here:
The first two questions were raised by my PR, the third one was raised by you. Ad 1: std::ranges::range_value_t<rng_t> tmp;
for (std::ranges::range_value_t<rng_t> elem : range)
{
if (foo)
{
frobnicate(elem);
tmp = elem;
}
else
{
frobnicate(tmp);
}
} Since our reference type does not include I don't know if anyone else uses this data structure at the moment, but I expect impact to be minimal as explicit use of the value_type is rare and even use-as-shown-above would still be valid (even faster). The interfaces are still experimental anyway. Ad 2: auto magic_input = /**/;
concatenated_sequences<std::string> output;
for (/**/)
{
std::string buffer;
std::ranges::copy(first_subrange_of_magic_input, buffer);
buffer += '|';
std::ranges::copy(second_subrange_of_magic_input, buffer);
buffer += '|';
std::ranges::copy(third_subrange_of_magic_input, buffer);
output.push_back(string_buffer);
} Now the problem is that all the data is needlessly copied to the buffer and then into the concatenated_sequences. Moving does not help here, because concatenated_sequences cannot re-use the memory. This is an important limitation compared with a auto magic_input = /**/;
concatenated_sequences<std::string> output;
for (/**/)
{
output.push_back();
output.append_back(first_subrange_of_magic_input);
output.push_back_back('|');
output.append_back(second_subrange_of_magic_input);
output.push_back_back('|');
output.append_back(third_subrange_of_magic_input);
} This avoids the unnecessary copy operations. These changes are non-breaking because they only add members. If people don't like the names, we can change them. Ad 3:
I need both of these things. I am also not sure how it handles dynamic growth, but that might be solvable. So while pmr and pool-allocators certainly have use-cases, I don't think they fit for me here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not yet have an opinion about whether these changes should be implemented. I think I would need some more experience in designing container APIs. But I am interested in following the discussion.
I only added some thoughts about the changes in case we want to merge them.
@@ -200,26 +97,26 @@ class concatenated_sequences | |||
* \{ | |||
*/ | |||
|
|||
/*!\brief == inner_type. | |||
/*!\brief An views::slice that represents the range on the concatenated vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A views::slice
* \experimentalapi{Experimental since version 3.1.} | ||
*/ | ||
template <std::ranges::forward_range rng_type> | ||
void append_back(rng_type && value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is an inner push_back_back(...)
that corresponds to the outer push_back(...)
, maybe it would be nice to have and outer version of this function, e.g. append(...)
. This could prevent confusion, because append_back(...)
sounds similar to push_back(...)
, but is actually an inner fucntion like push_back_back()
.
This technically exists already with some more flexibilty in the form of insert(...)
. So maybe it would be nice to have a directly corrensponding set of insert_back()
overloads. However, this would likely not be ergonomic, due to the iterator-based API of insert(...)
and the fact that we return a view as outer elements.
So I guess we stick with this ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly my problem at the moment. The proposed changes do not fit well into the current design of this data structure. Irrespective of the naming. Nor do they fit into any known interface of similar structures from the STL. I am not saying that this is necessarily a bad thing. But if the proposed changes are important in order to use it efficiently then the semantic meaning of this structure should be reconsidered. Or maybe a different structure is needed.
General design concernsI think there is a little misunderstanding. I didn't propose any data structure. I wanted to make sure I understand all of the implications of the suggested changes in order to add constructive ideas to whole thing. Regarding this, I needed to know what is the closest semantical meaning to a compliant STL solution, which you agree, is a
I am asking, because alternatively the concat sequence could also be seen as a special These are two completely valid scenarios and probably can be implemented with the same internal data structures but they are semantically very different and therefore their interfaces must not be mixed! Inline answers
That's not correct.
The returned type has reference semantics. A view is nothing but a reference to some memory owned by some other object.
Again, making the value_type = reference opens the door for (unintentional) dangling references which brings us directly into the land of UB. for (std::ranges::range_reference_t<rng_t> elem : range) { ... }
for (auto && elem : range) { ... }
for (std::ranges::forward_range auto && elem : range) { ... } Either people know C++ and know the difference between reference semantic and value semantic or they don't, because they might be very new to C++ and have to learn this over time. Which is how it is. No one can change that and is not a criticism. It is just part of the language.
Proxies are never easy. But that does not mean it is ok to weaken other guarantees. auto magic_input = /**/;
concatenated_sequences<std::string> output;
for (/**/)
{
output.resize(output.size() + 1);
auto & last_buffer = output.back();
last_buffer.reserve(1024); // possibly reserve some additional memory
std::ranges::copy(first_subrange_of_magic_input, std::back_inserter(last_buffer));
last_buffer.push_back('|');
std::ranges::copy(first_subrange_of_magic_input, std::back_inserter(last_buffer));
last_buffer.push_back_back('|');
std::ranges::copy(first_subrange_of_magic_input, std::back_inserter(last_buffer));
/// ...
} Doing so for the last element will be as efficient as it gets for expanding memory to the end of a container.
Well, there is misunderstanding. The memory resource itself should not be owned by the concat sequence. I think that is not how the pmr allocator stuff should be used. But the concat sequence, aka. vector<vector<>>, is initialised with the address of the memory resource. And this one has to live as long as the objects using its memory live. My conclusionSo I am in favour of the concacat sequence. Don't get me wrong on this. Otherwise, we are again starting to be not STL-like. As such the interfaces become harder to use, as well as harder to maintain, as everyone involved has to understand, why it is not possible to use a container as one is supposed to do. |
Thank you for taking the time to illustrate your view on this. As much as I would like to discuss all these points in detail, I simply do not have the time or energy to do so. I am already spending more time arguing for this that I spent writing it 😿 I will briefly summarise my arguments and will leave all further changes, discussions and decisions up to the team.
Most users are not aware of either of these. They call To be able to do this change, we need to change the value_type, as well. It looks like the value_type is an important part of the design, but actually it is not. In fact, for the given data structure, none of my code ever uses it. You only use the value_type if you explicitly decide to use the value_type, and you know what you are doing––in which case you can also copy the elements if you need to.
You have repeatedly said that library designers should not create designs out of thin air, but based on use-cases and feedback from downstream developers. I, being such a downstream developer and user of the given data structure, have identified a problem in the data structure that prevents me from using it efficiently. I have also created a solution to the problem by adding several member functions, each with a maximum of two-lines implementation––that's very small compared to all the append() and insert() functions this class has.
SeqAn3 has had this data structure for five years already, and previous SeqAn versions and applications made heavy use of the predecessor. So I am not quite sure why the "why" needs to be discussed right now. If you do decide that you want to re-discuss the entire data structure, here are my current requirements:
That having been said, since SeqAn3 has this class now that is fully written and tested and can be interchanged with |
I feel you brother 😎
Just to clarify, there is not a single sentence, where I am arguing against this data structure.
So, please don't cite me wrong! Thank you 🙏 |
Core Meeting 09.03.2022
|
I will do the renaming as suggested and update the PR.
Changing this means that the underlying container will always be a vector and cannot be changed anymore. That means no What I will include in this PR (if that's OK), is renaming the template parameter from |
This is a good point. We only considered the user interface. Without the proxy, you cannot return the container in any way, i.e.,
it still matters as to how the sequence is stored internally. This probably also shows that it's not clear what the template parameter actually influences — from the documentation, it's not immediately clear to me that this is the actual type that is internally used to store the sequences. We will need to decide whether we want:
I don't see why not 👍 |
98f0f44
to
66e58f7
Compare
I updated the documentation to reflect this. What do you think?
Right now, the two template parameters correspond to the two member variables. I think that this as simple as it gets, and being transparent about this, helps users understand how the data structure works and what the implications are. edit: But that's just my opinion, and I don't have strong feelings about it, as long as the functionality that I need is present :) |
@seqan/core I wasn't at the core meeting so I won't refute your decisions. I still like to comment on your rationals, as I am afraid that we are departing from our initial design goals we agreed on for SeqAn3. Please take my comments as additional feedback notes and not as personal criticism. I merely try to follow common practices of good software design suggested by software professionals. In summary, I proposed two ways how you can achieve the same thing, i.e. efficient expansion of the underlying concat buffer at the right end, with a proper interface and a clean abstraction (the key word is separation of concerns). If you argue that a user has to understand the concepts of a proxy please be aware that the user only works with a "reference type" irrespective of how it is implemented. A proxy is just a design pattern to achieve the reference semantic where Alternatively, I suggested to make the concat function itself part of a concept by offering a concat CPO. This allows anyone with special needs to overload certain behavior of the underlying concat buffer without changing the base class and is therefore a good tool if you want to guarantee certain API stability. If you argue against this because a user has to understand CPOs please be aware that the entire implementation logic behind concepts, used for example for our alphabets, the views etc. are based on it. You argue about more code to be tested but have to test and document all added interfaces anyhow. In fact you have to add unnecessary documentation to explain why there are interfaces which deviate from everything the user knows just to hope he is not using it wrongly. Based on the literature about software design, I promise you; he will! As a final remark, I suggest you remove the |
Hi @rrahn, thanks for your feedback. We all agreed that a proper proxy type would be the cleaner design but none of us has the resources to implement it (even if it would not be a full fledged proxy). We cannot remove the |
The proxy was already there! It was just removed by this PR but could have been extended instead.
That's the whole point of removing it. You can't rely on the proper behavior in generic code anymore. Consequently, the iterator does not work in all cases where the user wants the value type. Because it is not what it is supposed to be. If this is wanted an extra level of abstraction is needed somewhere. |
This PR simplifies and improves a couple of things:
span<>
if the inner_type is a vector which is something you really want.