Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative for seqkit locate that can locate subsequences/motifs #106

Open
abearab opened this issue Mar 7, 2024 · 12 comments
Open

alternative for seqkit locate that can locate subsequences/motifs #106

abearab opened this issue Mar 7, 2024 · 12 comments

Comments

@abearab
Copy link
Contributor

abearab commented Mar 7, 2024

I found seqkit locate very useful but it was surprisingly slow. Would you think there is any biobear approach for that? In general, any seqkit functions are very common and useful – in case you are interested to look deeper :)

@tshauck
Copy link
Member

tshauck commented Mar 7, 2024

It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.

To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear to better support these use cases.

For example, use the seqkit example agctggagctacc...

-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2    |
+------+

You could then add that to a WHERE clause.

Perhaps, I'll add a function that replicates locate, but returns a list of matches... e.g.

SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}

Then more or less have the returned table be aligned with seqkits example? Is that a useful function.

@abearab
Copy link
Contributor Author

abearab commented Mar 9, 2024

Thanks @tshauck for your response. I think that's interesting addition (it's interesting that levenshtein distance is already implemented; I didn't know that! I'll need to ask you another question about that but I'll leave this question clear for now).

For the locate question here, I had some specific thought which is described here – https://github.com/abearab/PosSpacer. I just made this a public repo so you can see my algorithm design notes. However, I gave up on that until I could replace my codes with biobear! The preprocessed NGS reads needs to be counted and the location of spacers in sequencing reads would be nice to be reported.

@tshauck
Copy link
Member

tshauck commented Mar 11, 2024

@abearab, cool, that makes sense to me... I just updated biobear on pypi with the new quality score function as well as an alignment_score function that does basic local alignment between two sequences (e.g. e69df6b). I'll explore more how to return alignment position(s) within a string and follow up when I have some more info.

@abearab
Copy link
Contributor Author

abearab commented Mar 12, 2024

@tshauck I have issue installing it

(this might be a separate issue, I just created a new VM and fresh OS, so ...)

  × Building wheel for biobear (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [168 lines of output]
      Running `maturin pep517 build-wheel -i /home/abearab/miniforge3/envs/screenpro2/bin/python --compatibility off`
      🍹 Building a mixed python/rust project
      🔗 Found pyo3 bindings
      🐍 Found CPython 3.9 at /home/abearab/miniforge3/envs/screenpro2/bin/python
      📡 Using build options features from pyproject.toml
         Compiling libc v0.2.153
       ...
      The following warnings were emitted during compilation:
      
      warning: [email protected]: Failed to run: "cc" "--version"
      
      error: failed to run custom build command for `ring v0.17.8`
      
      Caused by:
        process didn't exit successfully: `/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/target/release/build/ring-646b006cd1874eb5/build-script-build` (exit status: 1)
        --- stdout
        cargo:rerun-if-env-changed=RING_PREGENERATE_ASM
        cargo:rustc-env=RING_CORE_PREFIX=ring_core_0_17_8_
        OPT_LEVEL = Some("3")
        TARGET = Some("x86_64-unknown-linux-gnu")
        HOST = Some("x86_64-unknown-linux-gnu")
        cargo:rerun-if-env-changed=CC_x86_64-unknown-linux-gnu
        CC_x86_64-unknown-linux-gnu = None
        cargo:rerun-if-env-changed=CC_x86_64_unknown_linux_gnu
        CC_x86_64_unknown_linux_gnu = None
        cargo:rerun-if-env-changed=HOST_CC
        HOST_CC = None
        cargo:rerun-if-env-changed=CC
        CC = None
        cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
        cargo:warning=Failed to run: "cc" "--version"
        cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
        CRATE_CC_NO_DEFAULTS = None
        DEBUG = Some("false")
        CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
        cargo:rerun-if-env-changed=CFLAGS_x86_64-unknown-linux-gnu
        CFLAGS_x86_64-unknown-linux-gnu = None
        cargo:rerun-if-env-changed=CFLAGS_x86_64_unknown_linux_gnu
        CFLAGS_x86_64_unknown_linux_gnu = None
        cargo:rerun-if-env-changed=HOST_CFLAGS
        HOST_CFLAGS = None
        cargo:rerun-if-env-changed=CFLAGS
        CFLAGS = None
      
        --- stderr
      
      
        error occurred: Failed to find tool. Is `cc` installed?
      
      
      warning: build failed, waiting for other jobs to finish...
      💥 maturin failed
        Caused by: Failed to build a native library through cargo
        Caused by: Cargo build finished with "exit status: 101": `env -u CARGO PYO3_ENVIRONMENT_SIGNATURE="cpython-3.9-64bit" PYO3_PYTHON="/home/abearab/miniforge3/envs/screenpro2/bin/python" PYTHON_SYS_EXECUTABLE="/home/abearab/miniforge3/envs/screenpro2/bin/python" "cargo" "rustc" "--features" "pyo3/extension-module" "--message-format" "json-render-diagnostics" "--manifest-path" "/tmp/pip-install-r9v2vcz0/biobear_b2d0e519e8c74f2988c3017387959e41/Cargo.toml" "--release" "--lib"`
      Error: command ['maturin', 'pep517', 'build-wheel', '-i', '/home/abearab/miniforge3/envs/screenpro2/bin/python', '--compatibility', 'off'] returned non-zero exit status 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for biobear
Failed to build biobear
ERROR: Could not build wheels for biobear, which is required to install pyproject.toml-based projects

@abearab
Copy link
Contributor Author

abearab commented Mar 12, 2024

#106 (comment)

I had to install this in system level:

sudo apt install GCC

@tshauck
Copy link
Member

tshauck commented Mar 12, 2024

Ah, ok, glad you got it. Looks like I have a little work to do dependency-wise.

@abearab
Copy link
Contributor Author

abearab commented Mar 26, 2024

It's a very nice package overall, and to your point would make for a good reference to build a "recipe" list from.

To your question, depending on the end goal, there's a somewhat crude way you could accomplish this in biobear, but it may not fit your needs well enough. This relates back to your comments trimming, in that I'm looking at performant alignment / string functions that I can add to biobear to better support these use cases.

For example, use the seqkit example agctggagctacc...

-- if you wanted to match match against an adapter at the start and allow for two mismatches
❯ SELECT levenshtein(substr('agctggagctacc', 1, 5), 'agcaa') dist;
+------+
| dist |
+------+
| 2    |
+------+

You could then add that to a WHERE clause.

Perhaps, I'll add a function that replicates locate, but returns a list of matches... e.g.

SELECT locate(sequence_column, pattern, options...) -> list of structs {seqid, pattern, strand, start, end, matched_pattern}

Then more or less have the returned table be aligned with seqkits example? Is that a useful function.

Hi @tshauck – I'm getting back to this discussion and I need to use this functionality to process a dataset. I would be more than happy to test your tool in case you have any new features. What do you recommend to start with? I liked your idea to have that locate function as part of biobear! Let me know what you think, thanks.

@tshauck
Copy link
Member

tshauck commented Mar 27, 2024

Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).

For locate after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.

@tshauck
Copy link
Member

tshauck commented Mar 27, 2024

I have something in this branch I hope to finish up tomorrow. It's slightly different than locate as it requires a regex right now, but is relatively close... e.g.

❯ SELECT locate_regex('agctggagctacc', '[a][atcg][c]') AS locate; -- match a, then one of atc or g, then c
+-----------------------------------------------------------------------------------------------------+
| locate                                                                                              |
+-----------------------------------------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}, {start: 11, end: 14, match: acc}] |
+-----------------------------------------------------------------------------------------------------+

❯ SELECT locate_regex('agctggagctacc', 'agc') AS locate; -- match only agc
+-------------------------------------------------------------------+
| locate                                                            |
+-------------------------------------------------------------------+
| [{start: 1, end: 4, match: agc}, {start: 7, end: 10, match: agc}] |
+-------------------------------------------------------------------+

This is similar to the seqkit example: https://bioinf.shenwei.me/seqkit/usage/#locate

I also hope to add a non-regex based one similar to locate, but it's a little trickier... I'll follow up tomorrow when I know more about how hard/easy it is to add locate w/o the regex.

@tshauck
Copy link
Member

tshauck commented Mar 27, 2024

@abearab I updated biobear to support locate_regex as shown above. You can see how it looks through biobear/polars here:

def test_read_fastq_with_qs_to_list():
"""Test quality scores to list."""
session = connect()
fastq_path = DATA / "test.fastq.gz"
df = session.sql(
f"""
SELECT quality_scores_to_list(quality_scores) quality_score_list, locate_regex(sequence, '[AC]AT') locate
FROM fastq_scan('{fastq_path}')
"""
).to_polars()
assert len(df) == 2
assert df.get_column("quality_score_list").to_list()[0][:5] == [0, 6, 6, 9, 7]
assert df.get_column("locate").to_list() == [
[
{"start": 29, "end": 32, "match": "AAT"},
{"start": 36, "end": 39, "match": "AAT"},
{"start": 40, "end": 43, "match": "CAT"},
],
[
{"start": 29, "end": 32, "match": "AAT"},
{"start": 36, "end": 39, "match": "AAT"},
{"start": 40, "end": 43, "match": "CAT"},
],
]
.

Please let me know if you have any feedback on how that works for your task -- thanks!

@abearab
Copy link
Contributor Author

abearab commented Mar 28, 2024

Hey @abearab -- apologies for taking a bit to get onto this. The CRAM scanning feature is done, and I go started on the writing (though need some feedback from another developer).

For locate after the fixes to dependencies to afford faster installs recently, I should be able to get back to this tomorrow. What timeline are you working with? Obviously, I want to get something you could use, but don't want to lead you on if it's not feasible given your requirements.

Hi @tshauck, Thanks for looking into this. There is no time pressure on my end (i.e. I might also not test new features in biobear as the top priority). I'm currently working on CLI development ArcInstitute/ScreenPro2#35 but I would like to use locate / locate_regex function for topics like ArcInstitute/ScreenPro2#39.

@tshauck
Copy link
Member

tshauck commented Mar 28, 2024

Cool, that all sounds good to me, no rush for me. Please just "at" me on this issue if/when you use this for your work if you have any thoughts on improvements and/or questions. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants