Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update concatenated seqs #2947

Merged
merged 2 commits into from
Mar 13, 2022
Merged

Update concatenated seqs #2947

merged 2 commits into from
Mar 13, 2022

Conversation

h-2
Copy link
Member

@h-2 h-2 commented Feb 25, 2022

This PR simplifies and improves a couple of things:

  • change the value_type to be the same as the reference type.
  • change the reference type to be whatever slice returns (not some inherited proxy), so you get a span<> if the inner_type is a vector which is something you really want.
  • remove lots of cruft that is no longer necessary
  • add functions for efficiently growing the container by appending stuff to the end – IMHO this is crucial functionality of this data structure!

@h-2 h-2 requested review from a team and feldroop and removed request for a team February 25, 2022 16:47
@vercel
Copy link

vercel bot commented Feb 25, 2022

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/seqan/seqan3/D5oQYyuWYQ98oUtrPbSSgZqM5gqq
✅ Preview: https://seqan3-git-fork-h-2-updateconcatenatedseqs-seqan.vercel.app

@codecov
Copy link

codecov bot commented Feb 25, 2022

Codecov Report

Merging #2947 (9fb8618) into master (83336fa) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2947   +/-   ##
=======================================
  Coverage   98.28%   98.28%           
=======================================
  Files         267      267           
  Lines       11466    11474    +8     
=======================================
+ Hits        11269    11277    +8     
  Misses        197      197           
Impacted Files Coverage Δ
...qan3/alphabet/container/concatenated_sequences.hpp 97.30% <100.00%> (+0.07%) ⬆️
include/seqan3/argument_parser/argument_parser.hpp 97.59% <0.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83336fa...9fb8618. Read the comment docs.

@rrahn
Copy link
Contributor

rrahn commented Feb 26, 2022

Honestly, I have some difficulties to understand your proposed changes, regarding the semantics of this data structure. I have to admit, that I am not firm with how it was used before, so I might miss some valuable information.

So far, I see it as a dynamic container data structure specifically used to store sequences in one consecutive chunk of memory; a Container-of-Containers. It presumably improves memory efficiency by avoiding fragmented allocation, opposed to a regular solution using vector<vector<t>>, correct?

However, I think that pmr::vector<pmr::vector<t>> using a monotonic buffer resource shared among the inner vector types, would do the same, as long as elements within the outer container are not removed/inserted/modified, as this would not reuse the available memory but allocate the new elements to the end of the memory resource.
However, if that is the intended use case I think a pool_resource would be better suited anyhow.
But I didn't experiment with it a lot; only saw some nice weekly cpp episodes couple months ago. So my memory might be a little chunky about it.

Assuming that what I remember is true, then the only advantage of our container would be if it can compete with this regular stl solution performance-wise and improves the usability, because people might not be used to the pmr stuff, e.g. a memory resource can neither be copied nor moved and needs to live as long as the allocated memory is used by some part of the program.

But these changes don't look more usable to me. A push_back() function that does only prepare the outer container for inserting more elements below? And then push_back_back(v) to add an element to the last inner sequence? And this is all needed, because you want the container to still be dynamic, but return a view instead of a proxy? At least this is how I understand the changes at the moment. Maybe you can clarify this for me?

In addition, I can't take the value type anymore and store it somewhere else, without taking care of keeping the original concatenated container alive, as it would result in a dangling reference otherwise. And understanding this as a container, I would be very surprised of this kind of behavior. And even more so, it is then just like an alias of a pmr resource, is it not?

So using a proxy as reference type nicely encapsulates the intended behavior of having similar behavior to a vector<vector<t>>, which is great because in a generic context I can easily replace this with the naive solution without being forced to adapt the generic code. (Note I am not specifically referring to generic library code but more like when I develop my own stuff where I use for prototyping some simple solutions, which I replace later with some better structures to evaluate the improvements.)

So what I don't get is, why I can't wrap the returned proxy into a type_reduce view, which is explicitly intended for these kind of use cases. Especially, since the proxy could internally just wrap a std::span to the respective memory area. I am pretty optimistic that the compiler can optimize this, so it will be essentially the same as your intended change. But since I am then explicitly modifying the reference type in my user code it is totally fine, if I can't assign it to the value type of the concatenated container anymore. So it leaves the semantic separation of value type and reference in tact.

@h-2
Copy link
Member Author

h-2 commented Feb 28, 2022

OK, so I think we have three questions here:

  1. Should the value_type be a container?
  2. Do we need an API to append to the last element?
  3. Do we even need this data structure?

The first two questions were raised by my PR, the third one was raised by you.

Ad 1:
I don't see why we need a container as value_type. No parts of the API return or take the value_type, it is irrelevant to the immediate functionality of the data structure. What is the value_type's role in general? To allow temporary storage, e.g. during iteration (this happens in certain standard library algorithms I am told):

std::ranges::range_value_t<rng_t> tmp;
for (std::ranges::range_value_t<rng_t> elem : range)
{
    if (foo)
    {
        frobnicate(elem);
        tmp  = elem;
     }
     else
     {
        frobnicate(tmp);
     }
}

Since our reference type does not include & and it does not become invalid during iteration, our reference type satisfies the requirements for also being the value_type.
Why does this matter, why not just leave it as a container? Because it introduces (unintentional) copying in above example, and importantly, it prevents us from using a simple type as the reference type, because the value_type must be constructible from the reference type and e.g. std::vector is not constructible from std::span (at least not in C++20).
All the talk of "proxies" in the current file is a little misleading btw.; they do not enable "assigning through"; they just add "conversion to the value_type". IMHO this needlessly complicates the API. If we can return std::span or string_view, we should do so.

I don't know if anyone else uses this data structure at the moment, but I expect impact to be minimal as explicit use of the value_type is rare and even use-as-shown-above would still be valid (even faster). The interfaces are still experimental anyway.

Ad 2:
The core function of the data structure is to encapsulate a contiguous range (or something similar with contiguous layout, like bitpacked_sequence). Importantly, these data structures support amortised constant growth. They are typically not created out of thin air, but as the result of a computation or through file input.
For example:

auto magic_input = /**/;

concatenated_sequences<std::string> output;

for (/**/)
{
    std::string buffer;
    std::ranges::copy(first_subrange_of_magic_input, buffer);
    buffer += '|';
    std::ranges::copy(second_subrange_of_magic_input, buffer);
    buffer += '|';
    std::ranges::copy(third_subrange_of_magic_input, buffer);

    output.push_back(string_buffer);
}

Now the problem is that all the data is needlessly copied to the buffer and then into the concatenated_sequences. Moving does not help here, because concatenated_sequences cannot re-use the memory. This is an important limitation compared with a std::vector<std::vector> but it is unavoidable in this design.
My proposed API provides a nice "workaround":

auto magic_input = /**/;

concatenated_sequences<std::string> output;

for (/**/)
{
    output.push_back();
    output.append_back(first_subrange_of_magic_input);
    output.push_back_back('|');
    output.append_back(second_subrange_of_magic_input);
    output.push_back_back('|');
    output.append_back(third_subrange_of_magic_input);
}

This avoids the unnecessary copy operations.

These changes are non-breaking because they only add members. If people don't like the names, we can change them.

Ad 3:
The data structure you propose has some of the capabilities of concatenated_sequences but not all of them:

  • it is not copyable/movable
  • it does not provide efficient access to the concatenation

I need both of these things. I am also not sure how it handles dynamic growth, but that might be solvable. So while pmr and pool-allocators certainly have use-cases, I don't think they fit for me here.

Copy link
Member

@feldroop feldroop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not yet have an opinion about whether these changes should be implemented. I think I would need some more experience in designing container APIs. But I am interested in following the discussion.

I only added some thoughts about the changes in case we want to merge them.

@@ -200,26 +97,26 @@ class concatenated_sequences
* \{
*/

/*!\brief == inner_type.
/*!\brief An views::slice that represents the range on the concatenated vector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A views::slice

* \experimentalapi{Experimental since version 3.1.}
*/
template <std::ranges::forward_range rng_type>
void append_back(rng_type && value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is an inner push_back_back(...) that corresponds to the outer push_back(...), maybe it would be nice to have and outer version of this function, e.g. append(...). This could prevent confusion, because append_back(...) sounds similar to push_back(...), but is actually an inner fucntion like push_back_back().

This technically exists already with some more flexibilty in the form of insert(...). So maybe it would be nice to have a directly corrensponding set of insert_back() overloads. However, this would likely not be ergonomic, due to the iterator-based API of insert(...) and the fact that we return a view as outer elements.

So I guess we stick with this ^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly my problem at the moment. The proposed changes do not fit well into the current design of this data structure. Irrespective of the naming. Nor do they fit into any known interface of similar structures from the STL. I am not saying that this is necessarily a bad thing. But if the proposed changes are important in order to use it efficiently then the semantic meaning of this structure should be reconsidered. Or maybe a different structure is needed.

@rrahn
Copy link
Contributor

rrahn commented Mar 1, 2022

General design concerns

I think there is a little misunderstanding. I didn't propose any data structure. I wanted to make sure I understand all of the implications of the suggested changes in order to add constructive ideas to whole thing. Regarding this, I needed to know what is the closest semantical meaning to a compliant STL solution, which you agree, is a vector<vector<...>>, isn't it?
It seems the data structure was initially implemented with this design in mind. The fact, that all inner ranges are allocated in one big chunk of memory is not necessarily part of the API (at the moment it is, e.g. concat_reserve, which I think is wrong) but may be faster than a vector<vector<...>>.
Instead the data structure already has a concat function which allows to return some handle to the concatenated memory.
This is awesome, because it is a customisation point. So with a proper concat CPO we could now offer:

  1. a default implementation returning a join view
  2. return a std::span, as it is right now, for the special concat sequence
  3. easily decorate the current data structure to return something different for the concatenated sequence, e.g. a resizeable container, or a last_element_buffer_expander, which allows to modify the size of the last element only.

I am asking, because alternatively the concat sequence could also be seen as a special delimited string where I can add a special delimiter symbol to add information about the start of a new subrange. In that case push_back|append just adds to the end of the buffer meaning to the last subrange in the delimited string.
If needed one can convert the delimited string to a nested range<range<...>> view via some special API. Note I use the term string because append is not part of the general container API.

These are two completely valid scenarios and probably can be implemented with the same internal data structures but they are semantically very different and therefore their interfaces must not be mixed!

Inline answers

I don't see why we need a container as value_type. No parts of the API return or take the value_type, it is irrelevant to the immediate functionality of the data structure

That's not correct. push_back, insert, assign etc. all take the value_type of the container. Like any other STL container as well. It is just wrong in this design to assume that is_compatible_with_value_type<value_type_t> is something different, because you only check if the value type of the inner type (which is the value type of the concat sequence) is assignable to the reference type of the inner type. That might be a nice extension but seeing it as a container I expect the interface to be push_back(value_type), otherwise this data structure seems not STL-like and this is one of our major design goals, is it not?
Otherwise, the concat sequence should not be templatised over the value type, i.e. inner range, but over the alphabet type of the inner range and the container/allocator of the internal concat storage.

Since our reference type does not include & and it does not become invalid during iteration, our reference type satisfies the requirements for also being the value_type.

The returned type has reference semantics. A view is nothing but a reference to some memory owned by some other object.
A reference never gets invalid within the scope of the iteration, unless some bad code tinkers with the originating object while iterating over it. Or I don't understand what you mean with this exactly.

Why does this matter, why not just leave it as a container? Because it introduces (unintentional) copying in above example, and importantly, it prevents us from using a simple type as the reference type, because the value_type must be constructible from the reference type and e.g. std::vector is not constructible from std::span (at least not in C++20).

Again, making the value_type = reference opens the door for (unintentional) dangling references which brings us directly into the land of UB.
In parallel code it might be even necessary to use the value type at some point which would make the whole thing even more dangerous.
And this seems more likely, if I us a container like I am supposed to use a container, then using a non-idiomatic C++ for loop.
If not intended otherwise the correct loop would look more like one of these three versions:

for (std::ranges::range_reference_t<rng_t> elem : range) { ... }
for (auto && elem : range) { ... }
for (std::ranges::forward_range auto && elem : range) { ... }

Either people know C++ and know the difference between reference semantic and value semantic or they don't, because they might be very new to C++ and have to learn this over time. Which is how it is. No one can change that and is not a criticism. It is just part of the language.
Using some kind of performance profiling would also make it possible to spot this issue.
Finding some weird errors due to UB in a complex, possibly non-deterministic program will be even for experienced programmers a total nightmare.

All the talk of "proxies" in the current file is a little misleading btw.; they do not enable "assigning through"; they just add "conversion to the value_type". IMHO this needlessly complicates the API. If we can return std::span or string_view, we should do so.

Proxies are never easy. But that does not mean it is ok to weaken other guarantees.
It seems the proxy in this case is not doing what it should do, which would be the actual cause of your problem.
And the proxy, aka. reference type, has to implement the implicit conversion to the corresponding value type, not the value type.
Because implemented correctly your use case should be implementable like this -- also, I hardly believe that the overall performance will suffer when using an additional buffer like in your first example, as long as the size of the intermediate buffer is not huge, because for most cases I would expect memcpy to kick in:

auto magic_input = /**/;

concatenated_sequences<std::string> output;

for (/**/)
{
    output.resize(output.size() + 1);
    auto & last_buffer = output.back();
    last_buffer.reserve(1024); // possibly reserve some additional memory
    std::ranges::copy(first_subrange_of_magic_input, std::back_inserter(last_buffer));
    last_buffer.push_back('|');
    std::ranges::copy(first_subrange_of_magic_input, std::back_inserter(last_buffer));
    last_buffer.push_back_back('|');
    std::ranges::copy(first_subrange_of_magic_input, std::back_inserter(last_buffer));
    /// ...
}

Doing so for the last element will be as efficient as it gets for expanding memory to the end of a container.
Doing this for an internal element requires some extra handling of the operations to correctly populate the memory changes to the internal data structures. And yes that might then be not as efficient has having a simple vector<vector<>>.

The data structure you propose has some of the capabilities of concatenated_sequences but not all of them:

Well, there is misunderstanding. The memory resource itself should not be owned by the concat sequence. I think that is not how the pmr allocator stuff should be used. But the concat sequence, aka. vector<vector<>>, is initialised with the address of the memory resource. And this one has to live as long as the objects using its memory live.
That's why I think that there is a risk that many people might not be familiar with this and an explicit concat sequence class makes it easier.

My conclusion

So I am in favour of the concacat sequence. Don't get me wrong on this.
Moreover, I am totally fine with not offering a full-fledged container-like API for the concat sequence, respectively its proxy, but only what seems sensible for most of the relevant use cases. I think insert for example does not.
But if it resembles a container of the form vector<vector<...>> it must be usable as such.
So the proxy type must be fixed!

Otherwise, we are again starting to be not STL-like. As such the interfaces become harder to use, as well as harder to maintain, as everyone involved has to understand, why it is not possible to use a container as one is supposed to do.
An alternative, respectively wanted feature, would be as suggested above to customise certain behaviour by offering a proper CPO for the concat interface, which I believe could also handle your use case well.

@h-2
Copy link
Member Author

h-2 commented Mar 1, 2022

Thank you for taking the time to illustrate your view on this. As much as I would like to discuss all these points in detail, I simply do not have the time or energy to do so. I am already spending more time arguing for this that I spent writing it 😿

I will briefly summarise my arguments and will leave all further changes, discussions and decisions up to the team.

  1. value_type, reference_type

Most users are not aware of either of these. They call range[i] or for (auto && value : range) /**/ ("idiomatic c++"). What they get in both cases is the reference type, so the priority should be to get the reference type right. My proposal makes sure that the reference is std::span or std::string_view for most relevant specialisations of the class. Currently, it is an undocumented (for the user) type that inherits from one view and relies on another view (as_const) that only exists for this purpose (and is also private). In its current implementation it even has a negative impact on the performance because it destroys contiguousness of the range! The proposed changes on the other hand mean using std:: types more prominently and reducing complexity as well as lines of code.

To be able to do this change, we need to change the value_type, as well. It looks like the value_type is an important part of the design, but actually it is not. In fact, for the given data structure, none of my code ever uses it. You only use the value_type if you explicitly decide to use the value_type, and you know what you are doing––in which case you can also copy the elements if you need to.
Yes, in my proposed change, you can create a dangling value_type. But in all possible scenarios we are already creating a copyable but potentially dangling reference type which is much more severe, because it affects all general use-cases.

  1. "growing the last element in place"

You have repeatedly said that library designers should not create designs out of thin air, but based on use-cases and feedback from downstream developers. I, being such a downstream developer and user of the given data structure, have identified a problem in the data structure that prevents me from using it efficiently. I have also created a solution to the problem by adding several member functions, each with a maximum of two-lines implementation––that's very small compared to all the append() and insert() functions this class has.
While it certainly is possible to develop abstractions that encompass other (containers of)-containers, CPOs, custom proxy types.... that is just not what I require in my code, and at this point I think it would likely make it more difficult for me to use and not less. In any case, it certainly requires much more time to develop and discuss, which I see no point in doing unless there is a use case.

  1. "do we need this data structure", "PMR"

SeqAn3 has had this data structure for five years already, and previous SeqAn versions and applications made heavy use of the predecessor. So I am not quite sure why the "why" needs to be discussed right now. If you do decide that you want to re-discuss the entire data structure, here are my current requirements:

  • I don't need all the fancy constructors, insert and assign functions
  • I don't believe the seqan3::container concepts make any sense other than to use them internally to check whether something is "like std::vector" [see my dissertation on container concepts]
  • I don't think it should necessarily be called a container-of-container
  • I do need the capability to directly access the concatenation, to reserve underlying memory and to efficiently append

That having been said, since SeqAn3 has this class now that is fully written and tested and can be interchanged with std::vector<std::vector>, I also don't see why it should be removed or trimmed down. It can just be a superset of std::vector<std::vector>.

@rrahn
Copy link
Contributor

rrahn commented Mar 1, 2022

As much as I would like to discuss all these points in detail, I simply do not have the time or energy to do so.

I feel you brother 😎

do we need this data structure

Just to clarify, there is not a single sentence, where I am arguing against this data structure.
In fact I am saying:

... That's why I think that there is a risk that many people might not be familiar with this [pmr stuff] and an explicit concat sequence class makes it easier. ... So I am in favour of the concacat sequence. Don't get me wrong on this.

So, please don't cite me wrong! Thank you 🙏

@smehringer
Copy link
Member

smehringer commented Mar 9, 2022

Core Meeting 09.03.2022

value_type == reference_type?

We will adapt value_type == reference_type.

Reasoning (pro outweighs contra):

  • Contra: If the user explicitly casts to value_type, which is equal to the reference_type, we can end up with a dangling reference.
  • Pro: Simplifies user interface, lowers maintenance.

You usually do not cast to the value_type. Instead, you use the container API. Our API only returns the reference.

Currently, this is already producing a dangling reference, although it does not with std::vector<std::string> (https://godbolt.org/z/4ePnoW6hv):

// Always produces "dangling reference" / A view of the destructed heap of concatenated_sequences
auto foo ()
{
    seqan3::concatenated_sequences<std::string> con = /*...*/;
    return con[2];
}

Before this PR, the following was valid, but would now produce a dangling reference:

// Before: Implicit conversion proxy -> value_type
// After: "Dangling reference" / A view of the destructed heap of concatenated_sequences
seqan3::concatenated_sequences::value_type foo () 
{
    seqan3::concatenated_sequences<std::string> con = /*...*/;
    return con[2];
}

We choose consistent UB 🐞
We will document that accessing concatenated_sequences will never return a copy and if you want a copy, you need to do so explicitly.

API

We decided to accept @h-2's proposal with different naming.
Reasoning:

  • Low workload: The solutions of @rrahn to introduce a proper proxy or a CPO would need more work in coding, testing, and documenting. @h-2's proposal only requires some more documentation work.
  • Low complexity: Since we can drop the proxy type, the user does not need to understand the concept of a proxy, nor what a CPO is. They only need to read up on a new member function. This PR only introduces a few lines of code.
  • Implements the use case: Implementing a proper proxy would imply to implement an insert within the concatenated sequence, which does not have a use case for.
  • Design: @rrahn is right that we deviate from the STL and that the extra API will prohibit exchangeability with std::vector<std::vector<...>>. But if the user wants the advantages of the concatenated_sequences functionality, he will likely make a conscious choice and not change it to std::vector later. And if he really wants to, he can still go the inefficient way of copying into a buffer first.
  • Increased efficiency/efficacy: The proposed changes increase the efficiency of adding characters/a range of characters to the end of the sequence. The push_back() (empty parameter list) is slightly more efficient than a resize and intuitive enough to not add API-complexity.

Naming

We found the current naming perplexing.

We discussed these names:

  • deep_push_back/deep_append
  • inner_push_back/inner_append
  • last_push_back/last_append
  • back_push_back/back_append
  • push_back_back/append_back

We could not find analogous naming in any STL container or Python equivalents. There is span.last<3> which gives you the last 3 elements.

We decided for last_push_back/last_append

concatenated_sequences<char> output;

for (/**/)
{
    output.push_back();
    output.last_append(first_subrange_of_magic_input);
    output.last_push_back('|');
    output.last_append(second_subrange_of_magic_input);
    output.last_push_back('|');
    output.last_append(third_subrange_of_magic_input);
}

Follow-up decisions

Template parameter

This implies that the former value type is not needed. Therefore, the template argument should be the inner value_type.

seqan3::concatenated_sequence<char> con;

Edit: After #2947 (comment), we decided to keep it as is. However, we want to add some documentation to show the use of the underlying_container_t, e.g. using bitpacked_vector vs. using std::vector.

Benchmark

We need a benchmark. It is weird that we do not have any benchmark for concatenated_sequences that may show the general use of the data structure.

Summary

  • @h-2 Could you rename your function accordingly? Then we can merge this PR after our review progress to check for typos and stuff.
  • Add documentation/example for underlying_container_t.
  • We must add detailed documentation to warn about dangling references and recommend manual copying.
  • We should add a benchmark (including comparison vs. seqan2).

@h-2
Copy link
Member Author

h-2 commented Mar 9, 2022

I will do the renaming as suggested and update the PR.

This implies that the former "value type" is not needed an therefore the template argument should be the inner value_type

Changing this means that the underlying container will always be a vector and cannot be changed anymore. That means no bitpacked_vector and no custom allocators. I currently don't need either, but before changing this, you might want to re-evaluate these use-cases if you have not done so, yet.

What I will include in this PR (if that's OK), is renaming the template parameter from inner_type to underlying_container_type and clearly document that this is not "the value type".

@eseiler
Copy link
Member

eseiler commented Mar 9, 2022

Changing this means that the underlying container will always be a vector and cannot be changed anymore. That means no bitpacked_vector and no custom allocators. I currently don't need either, but before changing this, you might want to re-evaluate these use-cases if you have not done so, yet.

This is a good point. We only considered the user interface. Without the proxy, you cannot return the container in any way, i.e., concatenated_sequences<std::string> cannot return a std::string.
We overlooked that in

std::decay_t<inner_type> data_values;

it still matters as to how the sequence is stored internally.
This probably also shows that it's not clear what the template parameter actually influences — from the documentation, it's not immediately clear to me that this is the actual type that is internally used to store the sequences.

We will need to decide whether we want:

  • To keep it as is
  • Always use std::vector
  • Have another template parameter to specify the container, and default to std::vector
  • Have some kind of traits
  • Some other solution I didn't immediately think of :)

What I will include in this PR (if that's OK), is renaming the template parameter from inner_type to underlying_container_type and clearly document that this is not "the value type".

I don't see why not 👍

@h-2 h-2 force-pushed the update_concatenated_seqs branch from 98f0f44 to 66e58f7 Compare March 9, 2022 16:15
@h-2
Copy link
Member Author

h-2 commented Mar 9, 2022

from the documentation, it's not immediately clear to me that this is the actual type that is internally used to store the sequences.

I updated the documentation to reflect this. What do you think?

We will need to decide whether we want:

Right now, the two template parameters correspond to the two member variables. I think that this as simple as it gets, and being transparent about this, helps users understand how the data structure works and what the implications are.

edit: But that's just my opinion, and I don't have strong feelings about it, as long as the functionality that I need is present :)

@smehringer smehringer requested review from a team and eseiler and removed request for a team March 10, 2022 09:49
@eseiler eseiler merged commit 9ec41b7 into seqan:master Mar 13, 2022
@rrahn
Copy link
Contributor

rrahn commented Mar 15, 2022

@seqan/core I wasn't at the core meeting so I won't refute your decisions. I still like to comment on your rationals, as I am afraid that we are departing from our initial design goals we agreed on for SeqAn3. Please take my comments as additional feedback notes and not as personal criticism. I merely try to follow common practices of good software design suggested by software professionals.

In summary, I proposed two ways how you can achieve the same thing, i.e. efficient expansion of the underlying concat buffer at the right end, with a proper interface and a clean abstraction (the key word is separation of concerns).
You decided for the allegedly less-work-for-the-developer version by adding irregular interfaces (e.g. push_back() when you can already use resize(1) or push_back(std::views::empty<...>), can you not?). This will become problematic in the future. First, it makes the data structure harder to use correctly and easier to use wrong (I recommend reading "Scott Meyers Effektive C++"). Second, developing other variants of the concatenated sequence with different behavior in the future will become more complex requiring more code, more tests and more documentation, likely introduces differences in the API of presumably the same thing with different behavior in special cases; it is simply not possible to anticipate all use cases that will come. That's why there are well defined design patterns to flexibly adapt the software in the future when needed (I recommend reading "Gang of Four design patterns" to start with this topic).
As a consequence, the current changes deviate from the original SeqAn3 design goals we agreed on.
One central aspect was to stay close to the STL with the argument that it is better to have a familiar API even if for some edge cases the performance might be lower. And again, from experience, I highly doubt that a proper proxy type that allows reserve/resize of the underlying ranges will be less efficient than the current solution.

If you argue that a user has to understand the concepts of a proxy please be aware that the user only works with a "reference type" irrespective of how it is implemented. A proxy is just a design pattern to achieve the reference semantic where & can't be used. Doing so would allow to have in the first version a static representation of the underlying buffer. By using the strategy design pattern the resize/reserve feature can be added later, for example if a use case like this one comes along. As I stated before, the added proxy must not contain all features of the data type it represents, but can be added when needed.

Alternatively, I suggested to make the concat function itself part of a concept by offering a concat CPO. This allows anyone with special needs to overload certain behavior of the underlying concat buffer without changing the base class and is therefore a good tool if you want to guarantee certain API stability. If you argue against this because a user has to understand CPOs please be aware that the entire implementation logic behind concepts, used for example for our alphabets, the views etc. are based on it. You argue about more code to be tested but have to test and document all added interfaces anyhow. In fact you have to add unnecessary documentation to explain why there are interfaces which deviate from everything the user knows just to hope he is not using it wrongly. Based on the literature about software design, I promise you; he will!

As a final remark, I suggest you remove the value_type member entirely because right now it is meaningless. Obviously, it is not what it promises it to be which is the type the concatenated sequence was instantiated with.
Since the data structure behaves like a regular container by owning the memory, it would be irritating that the value_type is the reference type.

@smehringer
Copy link
Member

Hi @rrahn, thanks for your feedback.

We all agreed that a proper proxy type would be the cleaner design but none of us has the resources to implement it (even if it would not be a full fledged proxy).
If anyone finds to time to work on this, we can deprecated the formerly introduced functions and replace them by the proxy. The data type is not API stable anyway.

We cannot remove the value_type member as the random_access_iterator_base needs it.

@rrahn
Copy link
Contributor

rrahn commented Mar 21, 2022

The proxy was already there! It was just removed by this PR but could have been extended instead.
If this feature is not prioritized why not postponing it then? But again @seqan/core made the call; I just wanted to point out that the reasoning has some flaws.

We cannot remove the value_type member as the random_access_iterator_base needs it.

That's the whole point of removing it. You can't rely on the proper behavior in generic code anymore. Consequently, the iterator does not work in all cases where the user wants the value type. Because it is not what it is supposed to be. If this is wanted an extra level of abstraction is needed somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants