Sortagrad #13

giuseppeCoccia · 2019-11-22T13:01:27Z

Added sortagrad train strategy + additional tests. The code has been mainly taken from the deepspeech_internal repo and slightly modified to make it work within myrtlespeech

julianmack

Small number of small refactoring and docstring comments

julianmack · 2019-11-25T12:25:47Z

src/myrtlespeech/builders/task_config.py

+            batch_sampler=SortaGrad(
+                indices=range(len(train_dataset)),
+                batch_size=task_config.train_config.batch_size,
+                shuffle=shuffle,
+                drop_last=False,
+            ),


I think this was in deepspeech_internal but could we avoid repetition by changing to:

if task_config.train_config.sortagrad: batch_sampler = SortaGrad(...) collate_fn = seq_to_seq_collate_fn_sorted else: batch_sampler = SequentialRandomSampler(...) collate_fn = seq_to_seq_collate_fn

And then define the same train_loader in both cases?

See next comment for seq_to_seq_collate_fn vs seq_to_seq_collate_fn_sorted

I did the exact same thing you have written, but unfortunately black was complaining during the pre-commit because he thinks the variable batch_sampler should be always a SortaGrad or a SequentialRandomSampler (he thinks there is a type inconsistency). I couldn't find any solution to this problem other than duplicate the code in the if-else statements.

I think you can define the types in the main body in this case:

from typing import Union batch_sampler: Union[SortaGrad, SequentialRandomSampler] if task_config.train_config.sortagrad: batch_sampler = SortaGrad(...) collate_fn = seq_to_seq_collate_fn_sorted else: batch_sampler = SequentialRandomSampler(...) collate_fn = seq_to_seq_collate_fn

It seems it is working, but I couldn't avoid also the repetition of the batch sampler initialization

julianmack · 2019-11-25T12:37:02Z

src/myrtlespeech/data/batch.py

+    # Sort the samples
+    samples = [
+        (input, in_seq_len, target, target_seq_len)
+        for input, in_seq_len, target, target_seq_len in zip(
+            inputs, in_seq_lens, targets, target_seq_lens
+        )
+    ]


Can this function be deleted so that there is just seq_to_seq_collate_fn() with a bool argument for sorting as I think these lines are the only addition?

collate functions in pytorch get by default just the batch argument at runtime. I've added a commit where I did a small hack to make it work with 2 arguments (batch and the bool sort arguments) and it consists of using a lambda function when you create the DataLoader. Let me know if this solution looks clearer or not

src/myrtlespeech/data/sampler.py

tests/data/test_batch.py

julianmack

I think it should be possible to deal with the black errors - I have suggested a way that works I think.

edit: (I didn't mean to request more changes here - same ones are outstanding)

julianmack · 2019-11-27T08:11:42Z

src/myrtlespeech/builders/task_config.py

+            batch_sampler=SortaGrad(
+                indices=range(len(train_dataset)),
+                batch_size=task_config.train_config.batch_size,
+                shuffle=shuffle,
+                drop_last=False,
+            ),


I think you can define the types in the main body in this case:

from typing import Union batch_sampler: Union[SortaGrad, SequentialRandomSampler] if task_config.train_config.sortagrad: batch_sampler = SortaGrad(...) collate_fn = seq_to_seq_collate_fn_sorted else: batch_sampler = SequentialRandomSampler(...) collate_fn = seq_to_seq_collate_fn

…ollate_fn

julianmack

Nice - all looks good! Great.

samgd · 2019-11-28T09:43:47Z

src/myrtlespeech/configs/deep_speech_2_en.config

  shuffle_batches_before_every_epoch: true;
+  sortagrad: true;


The Deep Speech 2 paper describes SortaGrad as:

Specifically, in the first training epoch we iterate through minibatches in the training set in increasing order of the length of the longest utterance in the minibatch. After the first epoch training reverts back to a random order over minibatches.

Having both a shuffle_batches_before_every_epoch and sortagrad option is not consistent with this. i.e. What does shuffle_batches_before_every_epoch: false; sortagrad: true mean?

A potential alternative is to have a shuffle_strategy-like field:

oneof shuffle_strategy { Unshuffled unshuffled = 1; Random random = 2; SortaGrad sorta_grad = 3; }

Unshuffled and Random are not good names but hopefully this gives an idea.

Thoughts on this?

Basically if sortagrad: true the shuffle_batches_before_every_epoch flag will be ignored in the first epoch, so I agree with the fact that having two separate flags could be a bit confusing for the final user

samgd · 2019-12-02T14:21:34Z

src/myrtlespeech/protos/shuffle_strategy.proto

+
+package myrtlespeech.protos;
+
+import "google/protobuf/wrappers.proto";


Now an unused import :-)

samgd · 2019-12-02T14:28:11Z

src/myrtlespeech/builders/task_config.py

            indices=range(len(train_dataset)),
            batch_size=task_config.train_config.batch_size,
-            shuffle=shuffle,
+            shuffle=True,


Does sorting within a single batch matter? Why does SortaGrad require it and the other two cases not?

What I thought was that it could matter especially in the case you use big batch sizes. Moreover I saw that in deepspeech_internal the collate function also sorts every single batch, so I sticked with that implementation.
The value of the sort variable is only used inside the collate function and it is True only when we want to sort every single batch. It is set to false in the sequential_batches if case because I have interpreted it as "we want to go through every batch in a sequential way but we don't care about sorting single batches".
On the other hand I set it to true in the sorta_gradcase because I have interpreted it as "we want to go through every batch in a sequential way for the first epoch and we also care about sorting each single batch".
Let me know if my interpretation was wrong and something should be changed.

What would change when the batch size increases?

The order of samples within a batch has no effect on the training process unless I'm overlooking something? i.e. the loss, gradient, etc will be the same.

If the above is true sort should just be fixed to one value for all cases to simplify the logic? True?

Make sense, I interpreted wrongly the sorting inside the collate function in deepspeech_internal. I have just removed the sort variable and changed consequently the collate function.
What are the init parameters for which I should add the documentation?

samgd · 2019-12-02T14:29:43Z

src/myrtlespeech/protos/train_config.proto

Nit: messages defined using CamelCase have variable names written using snake_case. This convention makes this line SortaGrad sorta_grad.

Changed it!
I actually thought Sortagrad was a single name instead of two, that's why I didn't use any underscore

samgd · 2019-12-02T14:40:26Z

src/myrtlespeech/data/sampler.py

            random.shuffle(indices)
        for index in indices:
            yield self.batch_indices[index]
+        self._n_iterators += 1


This should be moved before the for loop to match the semantics described in the comment. For instance, consider the following:

indices = list(range(64)) srs = SequentialRandomSampler( indices=indices, batch_size=8, shuffle=True, sequential={0} ) iter_1 = iter(srs) iter_2 = iter(srs) print(next(iter_1)) print(next(iter_2))

This will output:

[0, 1, 2, 3, 4, 5, 6, 7] [0, 1, 2, 3, 4, 5, 6, 7]

When it should be [0, 1, 2, 3, 4, 5, 6, 7] followed by a random batch of indices?

…a_grad variable name

samgd

👍 Nearly there!

Can the function/method/class init parameters be typed and documentation added?

samgd · 2019-12-03T10:54:01Z

src/myrtlespeech/builders/task_config.py

            indices=range(len(train_dataset)),
            batch_size=task_config.train_config.batch_size,
-            shuffle=shuffle,
+            shuffle=True,


What would change when the batch size increases?

The order of samples within a batch has no effect on the training process unless I'm overlooking something? i.e. the loss, gradient, etc will be the same.

If the above is true sort should just be fixed to one value for all cases to simplify the logic? True?

samgd

Adding in and checking documentation for Sphinx and updating the deepspeech_internal tests file for myrtlespeech are the final hurdles 👍

samgd · 2019-12-05T11:29:36Z

src/myrtlespeech/data/sampler.py

+        indices,
+        batch_size,
+        shuffle,
+        drop_last=False,
+        n_iterators=0,
+        sequential=None,


The parameters should be typed and then the types in the docstring can be removed.

samgd · 2019-12-05T11:29:49Z

src/myrtlespeech/data/sampler.py

+        self._n_iterators = n_iterators
+        self._sequential = sequential or {}

    def _batch_indices(self, indices, batch_size, drop_last):


samgd · 2019-12-05T11:30:32Z

src/myrtlespeech/data/sampler.py

+    """
+
+    def __init__(
+        self, indices, batch_size, shuffle, drop_last=False, start_epoch=0


Type the arguments and remove types from docstring.

samgd · 2019-12-05T11:34:26Z

src/myrtlespeech/data/sampler.py

+    other passes. See Deep Speech 2 paper for more information on this:
+    https://arxiv.org/abs/1512.02595


Restructured Text has syntax for creating a hyperlink, see here:

`Deep Speech 2 <https://arxiv.org/abs/1512.02595>`_

src/myrtlespeech/data/sampler.py

samgd · 2019-12-05T11:41:20Z

tests/data/test_sampler.py

@@ -0,0 +1,99 @@
+from myrtlespeech.data.sampler import SequentialRandomSampler
+from myrtlespeech.data.sampler import SortaGrad
+


These tests are copied over from deepspeech_internal, which is fine, but they should be updated to use Hypothesis.

Mainly, dataset_gen, n_batches, batch_size, full_last_batch etc can become parameters of each test that are generated by a search strategy. The benefit is that now failing tests search over a wider range and find the minimal failing test case. Currently the tests have arbitrary values chosen for each of the above.

samgd · 2019-12-09T11:06:07Z

src/myrtlespeech/data/sampler.py

-    def __init__(self, indices, batch_size, shuffle, drop_last=False):
+    The iterator used each time this iterable is iterated over will yield
+    batches either sequentially (i.e. in-order) or randomly (uniform without
+    replacement) from `batches`.


Just re-reading the docstring via Sphinx and realised this may be out of date: what is batches? Should this now be yield batches of indices either sequentially...without replacement).?

samgd · 2019-12-09T11:07:51Z

src/myrtlespeech/data/sampler.py

+    sequential iterator is returned if the current count is in `sequential`.
+
+    Args:
+        indices: data with which batches are created.


Nit: d -> D?

samgd · 2019-12-09T11:10:07Z

src/myrtlespeech/data/sampler.py

+        drop_last: Optional[bool] = False,
+        n_iterators: Optional[int] = 0,


Optional[T] means type either T or None is OK. It is equivalent to Union[T, None] - see docs.

Both drop_last and n_iterators have concrete values - i.e. are never None - so can be bool and int respectively.

samgd · 2019-12-09T11:10:46Z

src/myrtlespeech/data/sampler.py

        self.batch_indices = self._batch_indices(
            indices, batch_size, drop_last
        )
+        self._n_iterators: Optional[int] = n_iterators


Mypy should infer the type as int here after the update above.

samgd · 2019-12-09T11:11:46Z

src/myrtlespeech/data/sampler.py

            indices, batch_size, drop_last
        )
+        self._n_iterators: Optional[int] = n_iterators
+        self._sequential: Union[Set, Dict] = sequential or {}


Adding types actually caught a bug here: {} is a dictionary rather than a set and hence mypy was complaining the type should be Union[Set, Dict].

This should be self._sequential = sequential or set() and mypy should infer the type OK.

samgd · 2019-12-09T11:42:34Z

src/myrtlespeech/data/sampler.py

+
+    def __init__(
+        self,
+        indices: Union[range, List],


indices is actually an Iterable[int]. This is based on the API Python uses for for loops: https://treyhunner.com/2016/12/python-iterator-protocol-how-for-loops-work/

(or maybe it's Sequential[int] to enforce a maximum size?)

I changed it with Iterable

samgd · 2019-12-09T11:42:55Z

src/myrtlespeech/data/sampler.py

+
+    def __init__(
+        self,
+        indices: Union[range, List],


See above comment.

samgd · 2019-12-09T12:02:15Z

tests/data/test_sampler.py

+    sequential = set(
+        sorted(random.sample(range(max_sequential), n_sequential))
+    )


max_sequential should actually be a set generated by Hypothesis rather than an integer that is then used internally to generate a set? This way Hypothesis can control n_sequential and shrink both values down when failing to the minimal possible test case.

I have created a separate function that returns a SearchStrategy for the list of sequential epoch numbers, so max_sequential and n_sequential are not needed anymore in the test parameters. Let me know if this solution could be fine or if I need to change it

samgd · 2019-12-09T12:10:48Z

docs/source/myrtlespeech/data/sampler.rst

@@ -0,0 +1,12 @@
+============


Nit: make these 1 longer than the title:

========= sampler =========

samgd · 2019-12-09T12:11:48Z

docs/source/myrtlespeech/data/sampler.rst

+.. autoclass:: myrtlespeech.data.sampler.SequentialRandomSampler
+    :members:
+    :show-inheritance:
+
+
+.. autoclass:: myrtlespeech.data.sampler.SortaGrad
+    :members:
+    :show-inheritance:


Can this be auto-generated to reduce chance of forgetting to update it in the future?

.. automodule:: myrtlespeech.data.sampler :members: :show-inheritance:

I have added this at the beginning of the file

giuseppeCoccia added 2 commits November 22, 2019 12:02

Added sortagrad strategy

6b160aa

Added tests for sortagrad

3751dde

giuseppeCoccia requested a review from samgd November 22, 2019 13:01

julianmack suggested changes Nov 25, 2019

View reviewed changes

julianmack suggested changes Nov 27, 2019

View reviewed changes

giuseppeCoccia added 4 commits November 27, 2019 11:12

Delete seq_to_seq_collate_fn_sorted and add sort flag in seq_to_seq_c…

e054ed9

…ollate_fn

Fix doc strings + add sortagrad reference

8bdddb2

Add docs about arguments in samplers functions

a0b3afe

Avoid dataloader repetition

6092921

giuseppeCoccia requested a review from julianmack November 27, 2019 15:54

julianmack previously approved these changes Nov 28, 2019

View reviewed changes

samgd suggested changes Nov 28, 2019

View reviewed changes

Change API to define shuffle strategy

852c574

giuseppeCoccia dismissed julianmack’s stale review via 852c574 November 28, 2019 16:41

Merge branch 'master' into sortagrad

8b9db16

giuseppeCoccia requested a review from samgd November 28, 2019 16:45

Add proto file for shuffle strategy

971552f

samgd suggested changes Dec 2, 2019

View reviewed changes

giuseppeCoccia added 3 commits December 2, 2019 15:49

rename sortagrad variable using snake_case + remove unused import

2997628

Fix test error caused by changed sorta_grad variable name

25b787e

Move update of num_iterators before for loop in sampler + change sort…

7965f44

…a_grad variable name

samgd self-requested a review December 3, 2019 10:52

samgd suggested changes Dec 3, 2019

View reviewed changes

Remove sort within single batches

26282f1

samgd self-requested a review December 5, 2019 11:27

samgd suggested changes Dec 5, 2019

View reviewed changes

giuseppeCoccia added 3 commits December 6, 2019 10:42

Add types + change doc syntaxt for hyperlink

b46ed75

Add sampler documentation to docs

ac09f5a

Update sampler tests to use Hypothesis

b4e0367

giuseppeCoccia requested a review from samgd December 9, 2019 09:33

samgd suggested changes Dec 9, 2019

View reviewed changes

giuseppeCoccia added 4 commits December 12, 2019 10:27

Fix doc strings and parameter types

4b683fa

Let the sampler doc be auto-generated

ea6c2f6

Make Hypothesis generate set of sequential epoch numbers

e31db68

Add separate function to create list of sequential epoch numbers

e21db25

samgd self-requested a review January 23, 2020 14:17


		package myrtlespeech.protos;

		import "google/protobuf/wrappers.proto";

		other passes. See Deep Speech 2 paper for more information on this:
		https://arxiv.org/abs/1512.02595

		@@ -0,0 +1,99 @@
		from myrtlespeech.data.sampler import SequentialRandomSampler
		from myrtlespeech.data.sampler import SortaGrad

		drop_last: Optional[bool] = False,
		n_iterators: Optional[int] = 0,

Sortagrad #13

Are you sure you want to change the base?

Sortagrad #13

Uh oh!

Conversation

giuseppeCoccia commented Nov 22, 2019

Uh oh!

julianmack left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julianmack left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julianmack left a comment

Choose a reason for hiding this comment

Uh oh!

samgd Nov 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samgd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samgd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

julianmack left a comment •

edited

Loading

samgd Nov 28, 2019 •

edited

Loading