Moved GPageSeg Logic into a Class #290

ZoranPavlovic · 2018-01-06T22:19:42Z

Changes:

Moved gpageseg processing into class
Added logging module
Added constants for default args values
Removed a few unused variables, parameters and imports.

I did this so the gpageseg logic/algorithm can be imported and used from other python scripts. From here, I was thinking to separate the processing into separate sub-methods so the processing can happen in-memory using passed image arrays rather than going through input/output files.

Added logging module Added constants for default values

…sue.

mittagessen · 2018-01-07T10:53:47Z

Might I ask why you put it into a class instead of just a simple function? There is no data to be isolated, no state to be tracked etc, so the class is really just a python module I have to instantiate.

Just expose a simple segment() method, wrap all the subroutines in there and hand it the numpy array/file object/path/PIL image. Might wrap it into a small helper function to get non-pickleable data (file objects) into the processing pool if necessary. For extra niceness add an __all__ containing only segment so people know not to touch anything else.

ZoranPavlovic · 2018-01-07T12:42:15Z

@mittagessen

I was aiming to make a minimal amount of changes and testing the waters regarding the feasibility of making drastic changes. Wrapping all the logic into a separate-scope via a class seemed like a minimal-impact change to start with, along with a personal preference for classes and the extensibility that comes with them.

Ideally, yes, I would have preferred to have moved most of the code in to a separate file/module in Ocrolib and have it referenced from the gpageseg script, with configuration being passed in somehow. It is, as you said, pretty much a module on its own.

But even then, most of the sub-functions of the segmentation process do share global state/configuration via the args object currently in the root of the script. The core of the problem is the giant process1 function that steers us into treating this as a singular module. It needs to be broken up into logical pieces, with load, calculate and output being separate somehow. That way, whether it's called from the script/command-line, or imported and called from another script with the data passed in, the same algorithm is run and we can extend the possible usages.

Please have a look at my other branch here where I've take the refactoring a bit further? It's the direction I'm thinking I'd like to go, and it appears to be playing-out nicely.

mittagessen · 2018-01-07T17:30:40Z

It still is an muti-purpose class that you hand an opaque configuration object to with lots of public methods that nobody should ever call and a magic process method with a kinda-not-really path as an argument. I'd start with replacing the global state of those subfunctions and separating actual page segmentation from the file loading and maaybe putting that into into another module instead of treating classes as submodules. Especially when your changes aren't even importable from anywhere else because ocropus-gpageseg is not a valid module name.

ZoranPavlovic · 2018-01-08T21:23:16Z

@mittagessen - Simply saying "replace the global state of those subfunctions" doesn't really explain what it is that you propose be done. The argument object is either passed from the root starting point all the way down to the deepest function, or we unravel all the parameters individually and pass them around as necessary, or we use a class to denote shared "global" state that is contained and easily injectable. The last one being a good compromise with very little signature-changing.

OR do you propose we externally set/fiddle with the script-level args object in order to control the behavior that should occur once we invoke the processing function?

mittagessen · 2018-01-09T06:57:33Z

@ZoranPavlovic Unravel the state in the opaque args parameter and make explicit function/class parameters out of it. Right now you're handing a dict-like containing all kinds of known and unknown parameters to your class and it is completely unknown which of its methods are affected by which parameter. Make all the implicit method arguments and the class (if you're so keen on this class) arguments explicit and just instantiate the object with **args. By the way, you can change the function signatures in any way you want, chiefly because I'm almost 100% sure nobody ever imported this code into their own mainly because the file it is contained in isn't a valid module name. So I'd also move the actual segmentation functionality to a sane place.

It might be understandable to use a configuration object when there are hundreds of partially mutually exclusive options but not for 10 arguments, especially when the configuration object doesn't even configure the whole class and you still have to hand it's methods additional arguments.

zuphilip · 2018-01-21T10:08:07Z

Thanks for working on this. But currently we have some other PRs regarding the page segmentation file, which I want to consider before any reorganization.

ZoranPavlovic added 3 commits January 6, 2018 23:55

Moved gpageseg processing into class

4efbfae

Added logging module Added constants for default values

Removed unused variables, parameters and imports. Minor formatting is…

c2f994a

…sue.

Minor bug in parallel pool map argument passing

443b049

zuphilip added the 🗣️ discussion label Jan 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moved GPageSeg Logic into a Class #290

Moved GPageSeg Logic into a Class #290

ZoranPavlovic commented Jan 6, 2018

mittagessen commented Jan 7, 2018

ZoranPavlovic commented Jan 7, 2018

mittagessen commented Jan 7, 2018

ZoranPavlovic commented Jan 8, 2018

mittagessen commented Jan 9, 2018

zuphilip commented Jan 21, 2018

Moved GPageSeg Logic into a Class #290

Are you sure you want to change the base?

Moved GPageSeg Logic into a Class #290

Conversation

ZoranPavlovic commented Jan 6, 2018

mittagessen commented Jan 7, 2018

ZoranPavlovic commented Jan 7, 2018

mittagessen commented Jan 7, 2018

ZoranPavlovic commented Jan 8, 2018

mittagessen commented Jan 9, 2018

zuphilip commented Jan 21, 2018