Автор Тема: Full fledged color scanning with OCR  (Прочитано 20122 раз)

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Full fledged color scanning with OCR
« : 05 ДХТаРЫм 2013, 19:40:50 »
Hi,

I have already scanned several books using 300 dpi tiff scans -> scantailor -> djvusolo or djvusmall.
However now I want to scan books with many large color illustrations on every page, a background that is not white, colored text and I want to OCR it.

Can someone tell me what's the best way to do this?
For an A4 page with a full page color photo what size can I expect?
If I use a tiff scan at 300dpi about 20MB (600 dpi, about 60MB), a scantailor process at 300 dpi (600dpi) output and a djvusolo saving at Photo 300 dpi (600 dpi), I get a file of about 500 KB (about 3 MB).
The 600 dpi image quality is great.
But the 300dpi case has some areas a bit blurry after scantailor and djvusolo.

Can we do better with djvu imager for example?
Is there another way?
I have been discussing this on LibraryGenesis forum here: http://genofond.org/viewtopic.php?f=1&t=6733&start=20
People have told me about didjvu and img2djvu, but I have not yet tried them.

So imagine that, as an example, I want to scan and OCR these sample pages:
http://whatimg.com/i/0wlfh.jpg
http://whatimg.com/i/zjd2b.jpg
http://whatimg.com/i/1x8ej.jpg

Thanks!

NBell

  • Постоялец
  • ***
  • Сообщений: 173
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #1 : 07 ДХТаРЫм 2013, 16:43:36 »
djvu imager has adjustable image quality. may be it will be better. ocr depends on ocr-software.
try djvusmall with Arcand 300 profile. result may be small. change background subsample to increase quality.

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #2 : 10 ДХТаРЫм 2013, 20:39:45 »
Thanks, I'll try it.

I have to say though, I'm not really understanding how to work with djvu imager. I tried it with ScanTailor and the tutorial here: http://www.djvu-soft.narod.ru/scan/djvu_imager.htm
but something's not working, I don't understand the workflow needed.

And I don't know what to do with the subsample on djvusmall. What does this do and mean? What values are should I put in? It's marked to "1" per default.

monday2000

  • Администратор
  • *****
  • Сообщений: 985
    • AOL клиент - -
    • Yahoo клиент - -
    • Просмотр профиля
    • Создание книг в электронном виде из бумажных книг (в формате DjVu)
    • E-mail
Re: Full fledged color scanning with OCR
« Ответ #3 : 10 ДХТаРЫм 2013, 22:23:28 »
zen.loop
Цитировать
http://whatimg.com/i/0wlfh.jpg
http://whatimg.com/i/zjd2b.jpg
http://whatimg.com/i/1x8ej.jpg
These links are broken. I could not look at the samples.

This article is a bit outdated: http://www.djvu-soft.narod.ru/scan/djvu_imager.htm . I should probably update it. But it will take me about a week. You could wait for this time if you wish.

NBell

  • Постоялец
  • ***
  • Сообщений: 173
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #4 : 11 ДХТаРЫм 2013, 18:44:57 »
use scantailor
image page set as "mixed"
apply st split 1.4 to output folder (use export from http://sourceforge.net/projects/scantailor/files/scantailor-devel/featured/)
result folder 1 with bw tif code as bw djvu
result folder 2 with color or grey tif code with imager
bw djvu use as target djvu

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #5 : 12 ДХТаРЫм 2013, 23:37:58 »
Thank, monday2000 and NBell!

I did try to follow that tutorial for scantailor, but I couldn't do it.
But anyway what I'm really interested is when the whole page is an image, so no need for text layer and image layer, no need for split.

Can you see this image?
http://postimage.org/image/c1zcelzqz/
This is what I want to know how to do, with good quality and not huge files.

I asked the guy who scanned that file at another site, he told me he scanned it at 600 dpi tiff, then used photoshop to crop, edit and compress to jpeg (I don't know how exactly), and then use acrobat to produce the pdf and OCR. In the end it's about 1 MB per page. Which I think is still too big.

If I scan at 300 dpi tiff, and use scantailor at output of Photo600 and then djvusolo/small at Photo600, I get about the same size and quality too.
But if I do it at 300 dpi, it gets all blurry and watery.

And how do we do the OCR on top of that full color djvu, which program do we use? Is the final size the same?

Thanks a lot for the feedback!
« Последнее редактирование: 12 ДХТаРЫм 2013, 23:41:15 от zen.loop »

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #6 : 13 ДХТаРЫм 2013, 02:33:24 »
People have been scanning color magazines and comic books for years, right?
What do they do? What's the standard quality process? And how large are the final files?

monday2000

  • Администратор
  • *****
  • Сообщений: 985
    • AOL клиент - -
    • Yahoo клиент - -
    • Просмотр профиля
    • Создание книг в электронном виде из бумажных книг (в формате DjVu)
    • E-mail
Re: Full fledged color scanning with OCR
« Ответ #7 : 13 ДХТаРЫм 2013, 20:35:35 »
zen.loop
Цитировать
Can you see this image?
Yes.
Цитировать
And how do we do the OCR on top of that full color djvu, which program do we use?
ABBYY FineReader the latest versions. Load your DjVu in it, OCR it, and generate the OCRed DjVu. Then use DjVuOCR 2 program to transpose the OCR layer from the ABBYY-made DjVu (which would be too big-sized) into original one.
Цитировать
Is the final size the same?
No, it is a bit bigger, but just a little bit.
Цитировать
then djvusolo/small at Photo600
You might experiment with the automatic segmention in DjVu Small - in order to auto-separate the white text into the text layer. In case of success it will cut the size of the final DjVu.

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #8 : 14 ДХТаРЫм 2013, 02:45:29 »
ABBYY FineReader the latest versions. Load your DjVu in it, OCR it, and generate the OCRed DjVu. Then use DjVuOCR 2 program to transpose the OCR layer from the ABBYY-made DjVu (which would be too big-sized) into original one.

Ah, I see. Indeed I had tried it with abbyy alone and it was not good.

You might experiment with the automatic segmention in DjVu Small - in order to auto-separate the white text into the text layer. In case of success it will cut the size of the final DjVu.

This I don't understand what you mean. Could you be more specific, please? Which settings should I change?

NBell

  • Постоялец
  • ***
  • Сообщений: 173
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #9 : 14 ДХТаРЫм 2013, 13:32:45 »
Цитировать
try djvusmall with Arcand 300 profile
step by step:
1. start DjvuSmall 0.4.4
2. drag and drop your tiff on program window
3. in LEFT LOWER corner present dropdown with label "Select encoding profile". Use it to set Arcand Scanned (300 dpi)
4. Press button "Convert"
5. If backgrond layer too blurry, press button "Options"
6. Press "Encode to Djvu"
7. In RIGHT LOWER corner check option "Bg subsample" & set it to 1 - obtain 1:1 background resolution. Try bigger value - size=quality/bg subsample
8. Close window by click on RED button in UPPER RIGHT corner
9. Press "Convert"
10. If do not satisfied with size - in step 7 try bigger value: size=quality/bg subsample

use FR11 DjVu Text Layer Crutch for FineReader 11.0.102.583 text layer transfer

monday2000

  • Администратор
  • *****
  • Сообщений: 985
    • AOL клиент - -
    • Yahoo клиент - -
    • Просмотр профиля
    • Создание книг в электронном виде из бумажных книг (в формате DjVu)
    • E-mail
Re: Full fledged color scanning with OCR
« Ответ #10 : 14 ДХТаРЫм 2013, 23:49:19 »
zen.loop
Цитировать
This I don't understand what you mean. Could you be more specific, please? Which settings should I change?
I mean the program profiles. Try to experiment with them.

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #11 : 15 ДХТаРЫм 2013, 15:29:27 »
Цитировать
try djvusmall with Arcand 300 profile
step by step:
1. start DjvuSmall 0.4.4
2. drag and drop your tiff on program window
3. in LEFT LOWER corner present dropdown with label "Select encoding profile". Use it to set Arcand Scanned (300 dpi)
4. Press button "Convert"
5. If backgrond layer too blurry, press button "Options"
6. Press "Encode to Djvu"
7. In RIGHT LOWER corner check option "Bg subsample" & set it to 1 - obtain 1:1 background resolution. Try bigger value - size=quality/bg subsample
8. Close window by click on RED button in UPPER RIGHT corner
9. Press "Convert"
10. If do not satisfied with size - in step 7 try bigger value: size=quality/bg subsample

Thanks, NBell. I had tried this but I was not understanding what "Bg subsample" values mean. Because when I change from 1 to higher the quality actually diminishes. Changing from 1 to higher makes the file smaller and more blurry. So it seems to me the best option I the default Arcand300dpi, without changing the settings. But this is still a bit blurry for me.

Can you download this sample scan?
https://www.wetransfer.com/downloads/4dc85a692d2c73d6e6be01508fa18d7c20130215105716/de627e#
It's a raw color tiff scan at 300 dpi.
13MB

If I use directly djvusmall (with no scantailor before) on the raw file, then with:
Arcand 300 dpi , I get a 180KB file, which is not very bad quality, but still too blurry.
Photo600, I get a 200KB not very good.
Photo300 is exactly the same as Photo600.
Aggressive600 gives me a 167KB file, not very good.

If I use scantailor and output it at Photo600 I get a 50 MB tiff file, with great quality.
Then I used several options over this 50 MB file on djvu small:
Aggressive600 to get a 222KB file with no blurring, great.
Arcand300 to get a 298KB a bit blurred.
Photo600 to get a 315KB with no blurring, great, looks like the Aggressive600.

I'll probably try this now, when the raw tiff scan is 600dpi.
Anyway the results above show that there is not a very great difference in size, and Arcand always the more blurry.
Is there other settings you recommend me trying?
Maybe I'll go for the Aggresive600, after output with scantailor at Photo600.
As I said, I did not understand the Bg subsample, because it got worse and smaller when I increased from 1 to other values.



Цитировать
use FR11 DjVu Text Layer Crutch for FineReader 11.0.102.583 text layer transfer

I tried this, and found some small russian instructions on a forum (which I translated with google), but it gave me an error: it did not open the djvu file with ocr that FR11 saved. The interface is all in russian unfortunately.
« Последнее редактирование: 15 ДХТаРЫм 2013, 15:33:41 от zen.loop »

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #12 : 15 ДХТаРЫм 2013, 15:35:30 »
zen.loop
Цитировать
This I don't understand what you mean. Could you be more specific, please? Which settings should I change?
I mean the program profiles. Try to experiment with them.

I was asking about the segmentation, and the text layers, I don't understand that. The encoding profiles I have been trying many.

NBell

  • Постоялец
  • ***
  • Сообщений: 173
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #13 : 15 ДХТаРЫм 2013, 16:05:10 »
djvu small have english help describing options
if you do not want to explore them - experimentally choose profile  you like and satisfy

ocr layer made by finereader 8 + djvuocr 2.4 (finereader 11 produce additional spaces and their doubled after transfer - if you like it - use it)

for segmentation and text layer structure read djvu 3 spec

in other words - RTFM
« Последнее редактирование: 15 ДХТаРЫм 2013, 17:52:26 от NBell »

zen.loop

  • Новичок
  • *
  • Сообщений: 14
    • Просмотр профиля
Re: Full fledged color scanning with OCR
« Ответ #14 : 15 ДХТаРЫм 2013, 19:12:39 »
Oh. I didn't notice there was a manual.. Sorry.

So, increasing subsample really does decrease quality. What we have to do is go to "Encode to DjVu(2)" and increase "Quality". In the manual it's says to choose it to 95.
And this does work well. I will try several of this combinations.

I might write an english tutorial on it in a couple of months. With several examples and combinations tabled and compared. Also thinking of improving the one existing and writing the one for djvusplit. So I was thinking of a tutorial for the 3 possibilities: b&w with few images (scan tailor+djvusmall), b&w text with lots of images (djvusplit), and full colour pages.

But the thing is I'm still not convinced djvu is the best for full color. Not that I know of an alternative, but it strikes me as odd that there is not a standard way known to do this, with good quality and small size. Comic books scans on the net are usually about 30MB per 100 pages, with great quality. But they are never djvu....