Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multidataset #1010

Merged
merged 22 commits into from
Apr 21, 2023
Merged

Add multidataset #1010

merged 22 commits into from
Apr 21, 2023

Conversation

yfyeung
Copy link
Collaborator

@yfyeung yfyeung commented Apr 18, 2023

greedy_search

test-clean & test-other sum config
1.9 & 4.06 5.96 epoch 30 avg 3
1.9 & 4.06 5.96 epoch 30 avg 4
1.91 & 4.06 5.97 epoch 30 avg 7
1.92 & 4.06 5.98 epoch 30 avg 5
1.93 & 4.05 5.98 epoch 30 avg 6
1.91 & 4.08 5.99 epoch 30 avg 2
1.91 & 4.1 6.01 epoch 30 avg 10
1.91 & 4.11 6.02 epoch 30 avg 8
1.91 & 4.11 6.02 epoch 30 avg 13
1.9 & 4.13 6.03 epoch 30 avg 12
1.91 & 4.12 6.03 epoch 30 avg 11
1.9 & 4.14 6.04 epoch 30 avg 1
1.92 & 4.12 6.04 epoch 30 avg 9
1.92 & 4.14 6.06 epoch 30 avg 14
1.95 & 4.2 6.15 epoch 30 avg 15
1.98 & 4.2 6.18 epoch 30 avg 16
2.0 & 4.24 6.24 epoch 30 avg 17
2.02 & 4.3 6.32 epoch 30 avg 18
2.02 & 4.32 6.34 epoch 30 avg 19
2.06 & 4.39 6.45 epoch 30 avg 20

modified_beam_search

test-clean & test-other sum config
1.89 & 3.99 5.88 epoch 30 avg 8
1.9 & 3.99 5.89 epoch 30 avg 7
1.88 & 4.02 5.90 epoch 30 avg 4
1.91 & 3.99 5.90 epoch 30 avg 5
1.9 & 4.0 5.90 epoch 30 avg 6
1.9 & 4.01 5.91 epoch 30 avg 3
1.89 & 4.03 5.92 epoch 30 avg 2
1.89 & 4.03 5.92 epoch 30 avg 9
1.9 & 4.03 5.93 epoch 30 avg 10
1.91 & 4.12 6.03 epoch 30 avg 1

fast_beam_search

test-clean & test-other sum config
1.9 & 3.98 5.88 epoch 30 avg 7
1.9 & 4.01 5.91 epoch 30 avg 6
1.9 & 4.01 5.91 epoch 30 avg 8
1.87 & 4.04 5.91 epoch 30 avg 9
1.92 & 4.0 5.92 epoch 30 avg 5
1.93 & 4.01 5.94 epoch 30 avg 4
1.92 & 4.03 5.95 epoch 30 avg 3
1.9 & 4.06 5.96 epoch 30 avg 10
1.93 & 4.05 5.98 epoch 30 avg 2
1.92 & 4.07 5.99 epoch 30 avg 1

@yfyeung yfyeung requested a review from csukuangfj April 21, 2023 04:35
parser.add_argument(
"--perturb-speed",
type=str,
default=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use str2bool.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

cut_set + cut_set.perturb_speed(0.9) + cut_set.perturb_speed(1.1)
)
if perturb_speed:
cut_set = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a log saying it is doing speed perturb.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


class MultiDataset:
def __init__(self, manifest_dir: str):
self.manifest_dir = Path(manifest_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what manifest_dir contains.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 34 to 38
filenames = list(
glob.glob(
f"{self.manifest_dir}/multidataset_split_1998/multidataset/multidataset_cuts_train.*.jsonl.gz"
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filenames = list(
glob.glob(
f"{self.manifest_dir}/multidataset_split_1998/multidataset/multidataset_cuts_train.*.jsonl.gz"
)
)
filenames = glob.glob(
f"{self.manifest_dir}/multidataset_split_1998/multidataset/multidataset_cuts_train.*.jsonl.gz"
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

)

pattern = re.compile(r"multidataset_cuts_train.([0-9]+).jsonl.gz")
idx_filenames = [(int(pattern.search(f).group(1)), f) for f in filenames]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
idx_filenames = [(int(pattern.search(f).group(1)), f) for f in filenames]
idx_filenames = ((int(pattern.search(f).group(1)), f) for f in filenames)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

idx_filenames = [(int(pattern.search(f).group(1)), f) for f in filenames]
idx_filenames = sorted(idx_filenames, key=lambda x: x[0])

sorted_filenames = [f[1] for f in idx_filenames]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sorted_filenames = [f[1] for f in idx_filenames]
sorted_filenames = (f[1] for f in idx_filenames)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -64,7 +64,7 @@ def get_args():
parser.add_argument(
"--perturb-speed",
type=str,
default=True,
default=str2bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer to multidataset.py for how to use str2bool.

Copy link
Collaborator Author

@yfyeung yfyeung Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just a mistake by accident...

@yfyeung yfyeung merged commit d67a49a into k2-fsa:master Apr 21, 2023
@yfyeung yfyeung deleted the multi branch April 21, 2023 10:09

logging.info(f"Loading {len(sorted_filenames)} splits")

return lhotse.combine(lhotse.load_manifest_lazy(p) for p in sorted_filenames)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use
lhotse-speech/lhotse#565

We only need to combine splits from the same dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants