Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface extensions #17

Merged
merged 88 commits into from
May 4, 2017
Merged

Interface extensions #17

merged 88 commits into from
May 4, 2017

Conversation

wasade
Copy link
Member

@wasade wasade commented Apr 10, 2017

Depends on #15. Adds in support for searching over metadata from the command line to find samples. It's a bit janky as you need to restrict your categories to what is being used in the query in the general case of searching over all samples. This is because, otherwise, you'd need to fetch all the metadata for all samples and that is not a good solution.

As an example, it is now possible to do the following:

$ redbiom search metadata --restrict-to AGE_YEARS --where "CAST(AGE_YEARS AS FLOAT) > 40" | redbiom fetch samples --context test --output metadata_search_test.biom

What's happening behind the scenes is that the metadata obtained are pushed into an sqlite database in order to handle the specific relational query. The backend for this is a copypasta of QIIME2's Metadata object.

Copy link
Collaborator

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, a few comments/questions. Have this code been deployed in the test environment and tested against the latest qiita public release?

help="The filepath to the sample metadata to load.")
def load_sample_metadata_search(metadata):
"""Load sample metadata."""
# TODO: merge with load_sample_metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? I'm not sure why. If so, perhaps adding an issue to clarify why is desirable will be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't really matter. Just felt a little conceptually off

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then remove?

This command will assess, per observation, the number of samples that
observation is found in relative to the metadata category specified.
"""
if threads == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of this and are we aware of a fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure yet. It could be my local system because I've used joblib with Session a bunch in other contexts, including within redbiom. This is not an issue with concurrent processes (ie totally independent executions), but with some shared state between forked processes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps retesting in the test environment will shine some light into the implications ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. We need to support thread-local sessions. This does not appear to be an issue on the load into sun-14 likely from the order of operations (ie luck). I'm going to remove this support here for now until thread-local sessions can be added.

help="The WHERE clause to apply")
def search_metadata(restrict_to, where):
"""Find samples by metadata"""
# TODO: deprecate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why deprecate?

@@ -74,17 +80,70 @@ def summarize_metadata(descending):
click.echo("%s\t%s" % (idx, val))


def _summarize_id(context, category, id):
"""Summarize the ID over the category"""
from redbiom.summarize import category_from_observations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only place where we have: from redbiom.summarize import category_from_observations vs. import redbiom.summarize and then call redbiom.summarize.category_from_observations, why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, can change

import pandas as pd
df = pd.DataFrame(mappings)
df.set_index('feature', inplace=True)
df[df.isnull()] = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this table is returning counts, a zero is more appropriate than a nan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, nan means that it wasn't there, 0 means no occurrences, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are integers, not floats, and zero occurrences are appropriate for similar reasons why zeros are appropriate in an OTU table

from redbiom.util import float_or_nan
tokens = list(shlex.shlex(criteria))
if len(tokens) > 1:
# < 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if these comments are needed ...

import shlex
from redbiom.util import float_or_nan
tokens = list(shlex.shlex(criteria))
if len(tokens) > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the special case for == 1? shouldn't if operator is None: cover that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code should be marked for deletion. it is no longer necessary with the improved search queries

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, why not just delete?

@wasade
Copy link
Member Author

wasade commented Apr 29, 2017

Thanks! This PR was exercised in the test environment a few commits back. A critical piece that does need to be exercised is the loading of the new indices. I don't think I'll be able to issue the loads this weekend unfortunately.

@wasade
Copy link
Member Author

wasade commented May 1, 2017

Okay, pushed all the metadata searching onto the new grammars and fixed the thread handling for the table summarization. Tests should pass here. I'm now going to fix the loader for from qiita to bring it up to speed with the new metadata search indices

@wasade
Copy link
Member Author

wasade commented May 3, 2017

redbiom fetch samples is currently handing redbiom IDs but is not handling QIIME compatible IDs yet. That needs to be fixed, do not merge until that commit is in (this morning).

@wasade
Copy link
Member Author

wasade commented May 3, 2017

redbiom select samples-from-metadata was not fully disambiguating IDs leading to a discrepency with redbiom summarize samples.

import requests
import os

pid = os.getpid()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is the thread/multiprocess work around. What we do is create a session per process ID. These are automatically closed out atexit. This allows for each process to operate without mutating another's session state.

@antgonza antgonza merged commit 37db432 into master May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants