-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for CTC in both Theano and Tensorflow along with image OCR example. #3436
Conversation
Sounds great! I'll review it tomorrow. CTC is definitely a much needed addition. For the time being, one immediate comment: please do not commit data files into the git tree, rather put them online and have your script fetch them like so: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py#L23 |
# for the particular OS in use. | ||
# | ||
# This starts off with easy 5 letter words. After 10 or so epochs, CTC | ||
# learn translataional invariance, so longer words and groups of words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "translation invariance"
Here's whats in the latest PR:
|
Thanks! Style-wise: still lots of unused imports. Otherwise LGTM. |
Latest commit fixes those unused imports. This has been a learning experience.... |
One last thing before I merge. Your commits are not associated with your Github email address. That means that your account won't be linked to the PR and you won't appear in the list of contributors. You may want to fix that (add your git email to your Github account, for instance). |
self.X_text = [] | ||
self.Y_len = [0] * self.num_words | ||
|
||
#monogram file is sorted by frequency in english speech |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-line comments require one space after #
LGTM |
Thanks for the great OCR example. Very valuable imo. |
We'll probably have to update the way to import CTC in TF. The current code appears to work with TF 0.9 but breaks in 0.10rc. |
Yea, they moved CTC from the experimental contrib area to core.util.ctc. I wasn't sure of the best "pythonic" way of checking multiple locations for an import. |
Here's one way to do this:
Unfortunately this breaks symmetry with how
Finally, another way to do this would be to make a tensorform compatibility layer elsewhere in the source code, and only access terraflow through that compatibility layer. |
…OCR example. (keras-team#3436) * Added CTC to Theano and Tensorflow backend along with image OCR example * Fixed python style issues, made data files remote, and made code more idiomatic to Keras * Fixed a couple more style issues brought up in the original PR * Reverted wrappers.py * Fixed potential training-on-validation issue and removed unused imports * Fixed PEP8 issue * Remaining PEP8 issues fixed
@mbhenry |
Its on my list to look into both of those. Variable width would probably have to be with Dynamic RNNs. Right now I'm also working on improving the convergence stability...slight disturbances seem to have big impacts on convergence. |
@mbhenry |
Is this loss function going to be documented? I assume it is experimental when not documented, at the same time it is merged. Is it released at all? |
@mbhenry, Thanks for a great example. I noticed that you 'skip' first couple letters generated in the code. Is there a reason for doing this? Is traversing the image in order important for generating the letters left-to-right? Would it make more sense to have a single image representation (perhaps at the end of LSTM that has seen all the "slices" of image features) and use RepeatVector to feed this image information to each timestep in the RNN? (e.g., something a simple captioning model would do) |
time_steps = img_w / (pool_size_1 * pool_size_2) | ||
|
||
fdir = os.path.dirname(get_file('wordlists.tgz', | ||
origin='http://www.isosemi.com/datasets/wordlists.tgz', untar=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the http server tries to redirect http to https, but in the process, it removes a slash so it becomes: 'http://www.isosemi.comdatasets/wordlists.tgz' which is an invalid address.
|
||
# transforms RNN output to character activations: | ||
inner = TimeDistributed(Dense(img_gen.get_output_size(), name='dense2'))(merge([gru_2, gru_2b], mode='concat')) | ||
y_pred = Activation('softmax', name='softmax')(inner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the activation be set according to the backend in use? Tensorflow's documentation reads[1]:
This class performs the softmax operation for you, so inputs should be e.g. linear projections of outputs by an LSTM
[1] https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss
Note, we merged in the master of Theano a ops for the CTC from baidu: http://deeplearning.net/software/theano_versions/dev/library/gpuarray/ctc.html |
This commit adds support for training RNNs with Connectionist Temporal Classification (CTC), which is a popular loss function for streams where the temporal or translational alignment between the input data and labels is unknown. An example would be raw speech spectrograms as input data and phonemes as labels. Another example is an input image that includes rendered text with an unknown translational location, word/character spacing, or rotation.
For Tensorflow, a wrapper was created for the built-in CTC code and put in tensorflow_backend.py. This wrapper is fairly complex as it has to transform a dense tensor into a sparse tensor. Note that in the bleeding-edge Tensorflow, they moved the location of CTC from contrib to util.
For Theano, an implementation was included courtesy of and used with permission from @shawntan. Because it was not written for batch processing, its quite a bit slower than Tensorflow - but it does work.
This commit includes an example that performs OCR on an image. The example works with both Theano and Tensorflow. The text-based image is generated using a list of single words (wordlist_mono_clean.txt) and double words (wordlist_bi_clean.txt). I did my best to make sure no profanity ended up in these lists, but apologies in advance if I missed something. Here is an example output after 40 epochs:
The text is printed onto a 512 x 64 image using a variety of fonts (note the font list works in Centos 7, but not sure what will happen on other OSes). This is done on the fly for all training images using generators. A random amount of speckle noise, font, rotation, and translation is applied. These images are then fed into a network consisting of two convolutional layers, a fully connected layer, two bidirectional recurrent layers, and finally a fully connected layer with 28 outputs (26 letters, space, and CTC blank). After about 10 epochs it does pretty well with 5 letter words, so harder words are introduced. After 20 epochs, phrases with spaces are introduced.
Additional notes:
This is my first Github commit ever so please go easy on me if I mucked something up :)