Still OCR: combine captured text + OCR'ing multiple images

etm001

状元
Hi,

I have "combined captured text" enabled for the still OCR block recognizer. Is it possible to combined captured text across multiple photos? That is:
  • Select to open image #1 from photo library.
  • [Block image OCR executes]
  • Select to open image #2 from photo library.
  • [Block image OCR executes --> appends text from image #2 to text from image #1]
It seems that as soon as you select a new image to OCR, the text from the prior image is lost. Is this correct?

Imagine that I want to OCR the text from 10 pages of a book. Would the best method be to save as a PDF in lieu of images, so that you can OCR all the text within one OCR "session"? (I actually tried this with my scanning app, but when I export the PDF to Pleco, for some reason Pleco couldn't open the file).

Thanks!

Update: I'm using TurboScan for the PDF scanning. When selecting "Send to", PDF files will open in Reader. For whatever reason Reader did not recognize any text in the PDF. I saved the file to Dropbox, then selected to open it via Still OCR, which worked perfectly. Would it possible in a future release to have a setting that would allow me to default whether incoming PDFs open in Reader or still OCR? This would allow me to send directly from the scanning app to Pleco OCR, cutting out the intermediate step of saving to Dropbox or iCloud, then opening the PDF from within still OCR.
 
Last edited:

mikelove

皇帝
Staff member
We're actually planning to merge those in a few releases, or at least to automatically detect whether a PDF has embedded text or not and detect it to the appropriate reader based on that. However, any incoming file will automatically be placed in your "Inbox" folder, so in this particular situation you can simply go to Still OCR / Image File and open the file from there.
 

etm001

状元
However, any incoming file will automatically be placed in your "Inbox" folder
Quite helpful - I didn't know Pleco did this.

I know from other posts that changes/improvements are coming to OCR, but for what it's worth:
  • If you want to create a "rolling" selection of text, you have to press "capture", then select to recognize a new block of text, then press capture again, etc. It's a bit cumbersome. It'd be great to streamline the process and remove that middle "capture" step - or perhaps allow semi- or full automatic recognition (i.e., point a document to the OCR module and let it process the pages as best as it can).
  • I think I said this in an earlier post, but I just scanned vertical text that had a lot of mixed Latin and traditional Chinese, as well as traditional Chinese punctuation. OCR right now is missing a lot of the punctuation and does a 50/50 job with the Latin text, so it would be great if that could be improved in future releases.
Question: with the A8/A8X series of processors, have we reached a point were we can expect some significant, near-desktop equivalent functionality in OCR (I read an old post in 2010 that said we were maybe "two or three generations away" from this)? Or do we need to wait another generation? I've thought a lot about mobile OCR lately, and whether it's fair to expect it's functionality/ability to converge with desktop functionality. On the one hand, an iPad Air 2 is as powerful as Macbooks from not too long ago (in some ways); on the other hand, the physical form factor does not lend itself well to scanning large volumes of text (snapping dozens or hundreds of photos of pages is impractical). So having said that, I'm looking forward to upcoming improvements that will make Pleco more useful for small and "mid-sized" scanning jobs.

Thanks!
 

mikelove

皇帝
Staff member
Recognizing all of the text in a PDF would not be too difficult, but I'm not sure if the results would be that great for text that's not cleanly formatted.

In general we're kind of waiting for the software to catch up with the hardware on this - none of the embedded Chinese OCR libraries we know of currently scale up to properly utilize an A8-class processor; they'll run really really fast but they won't slow down and recognize more carefully. Most of them were originally designed to recognize still images (business cards and such) on ~2008 vintage phones.

We've been nagging our OCR vendor about porting their desktop OCR SDK to mobile, and hopefully eventually they will (or somebody else will), but in the meantime we have to come up with our own uses for the extra CPU capacity (e.g. better text detection algorithms for signs and such).

One other thing we've toyed with is offering some sort of online OCR option for documents, but we're not sure if this would be popular enough to be worth the investment of time + money.
 

Paul Duke

进士
I'm curious about the original poster's first question. What's the best way in Pleco at the moment to OCR a document of several pages? Take pictures of individual pages, OCR them, copy and paste to a new document, then repeat for each page, copy and pasting into the original document?

Would be cool if the "combine captured text" option worked until you told it to start a new document.

And by the way, I haven't used OCR in a few years, but had reason to in the past few days... I'm really impressed at how well it works now! I don't remember it being this fast and accurate a few years back...
 

mikelove

皇帝
Staff member
At the moment the best bet would be to do it via a PDF - OCR doesn't support multi-page documents in any other fashion yet.

And thanks! We have indeed improved it a bit, yes.
 
Top