THBPdf Download Contact Us Buy Online Developerse-mail me

Creating searchable image PDFs using Ghostscript




Message-ID:<g9v696$84m$1@aioe.org>
Subject:

Creating searchable image PDFs using Ghostscript


Date:Sun, 7 Sep 2008 01:12:21 +0100


Hello,

What I have :

1. A bunch of pnm images generated by xsane from a manual.
2. Subsequently, these images are split up and cleaned using unpaper.
3. A working installation of ghostscript on Ubuntu.
4. A growing knowledge of the internal structure of PDFs as gleaned from the
Adobe PDF reference and the pdfmark reference.
5. A fine OCR tool (which is accurate enough for my purposes) - tesseract.
6. The following shell script that I wrote :

(just a segment)

<within a loop>

        convert -density 300 outputimage-$m.pgm outputimage-$m.tiff
        tesseract outputimage-$m.tiff outputimage-$m    
        rm outputimage-$m.tiff
        convert -density 300 outputimage-$n.pgm outputimage-$n.tiff
        tesseract outputimage-$n.tiff outputimage-$n    
        rm outputimage-$n.tiff
        potrace --level3 --backend postscript -r300 outputimage-$m.pgm -o
page-$m.ps      
        potrace --level3 --backend postscript -r300 outputimage-$n.pgm -o
page-$n.ps      
        echo "[ /Author (User)
        /Creator (User)
        /Producer (Ghostscript+potrace)
        /Keywords (`cat outputimage-$m.txt`)
        /DOCINFO pdfmark
        /F (outputimage-$m.txt) (r) file def
        [ /_objdef {mystream} /type /stream /OBJ pdfmark
        [ {mystream} F /PUT pdfmark
        [ /MyPrivateAnnotmyStreamData {mystream}
         /SubType /Text
         /Rect [ 10 10 30 30 ]
         /Contents (`cat outputimage-$m.txt`)
         /SrcPg 1
         /Open false
         /Color [1 1 0]
         /Title (Tesseract - OCR)
         /ANN pdfmark
        [ /Name <feff 0041 0073>
        /FS<<
                /Type /Filespec
                /F (outputimage-$m.txt)
                /EF << /F {fstream} >>
                >>
        /EMBED pdfmark
        [ /PageMode /UseOutline
        /Page 1 /View [/Fit]
        /DOCVIEW pdfmark
        [ /Subtype /Text
        /Title (Tesseract - OCR)
        /Alt (`cat outputimage-$m.txt`)
        /StPNE pdfmark
        [ {Catalog} <</Markinfo <</Marked true>>>> /PUT pdfmark" > pdfmarks-$m

gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -r300x300 -sPAPERSIZE=letter -dDOPDFMARKS -dPDFSETTINGS=/ebook -dPermissions=-4 -sOutputFile=pageannotated-$m.pdf
page-$m.ps pdfmarks     

<loop ends>

pdftk pageannotated-*.pdf cat output pdfjoined.pdf


What I do not have :

1. Properly created PDFs that are searchable.

Any ideas how one might embed OCR'ed text invisibly behind the page ?

I am aware of the amazing tool called gscan2pdf that does what I am seeking
to do here using a perl library called PDF2:API. That tool however makes
temporary copies of all the pnm and other files on the way and requires a
gigantic amount of temporary storage space (think 10G+) for the number of
pages I am trying to put together in a searchable PDF manual (think 1400
pages).




Message-ID:<ga0454$p1t$1$8300dec7@news.demon.co.uk>
Subject:

Re: Creating searchable image PDFs using Ghostscript


Date:Sun, 7 Sep 2008 09:42:09 +0100


Geico Caveman wrote:
> 
> 1. Properly created PDFs that are searchable.
> 
> Any ideas how one might embed OCR'ed text invisibly behind the page ?

I believe Acrobat, when it embeds OCR'ed text invisibly, uses
'layers' -- they can be switched on and off.

http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/


I don't know about Ghostscript, which has advanced greatly since
I used it for any serious work, but here are a couple of
alternatives.

1. Solid PDF Creator Plus.
    It claims to to do searchable OCR embedding.

    Not open source, but also not expensive. I have just
    got a free copy because I had previously bought a copy of
    an ebook from 'turbocash'


 
http://www.soliddocuments.com/features.htm?product=SolidPDFCreator#CreateScanToPDF


2. Using LaTeX and the 'attachfile' package. With this you can embed any
    file into your pdf, and your users can later extract it.

http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html


Cheers,
Ken








Message-ID:<ga2o37$7gl$1$8302bc10@news.demon.co.uk>
Subject:

Re: Creating searchable image PDFs using Ghostscript


Date:Mon, 8 Sep 2008 09:34:43 +0100


Geico Caveman wrote:
> Thanks for your response.
> 
>>> Any ideas how one might embed OCR'ed text invisibly behind the page ?
>> I believe Acrobat, when it embeds OCR'ed text invisibly, uses
>> 'layers' -- they can be switched on and off.
>>
>> http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/
> 
> Since PDF is an open spec, this should be documented somewhere (the
> technique and the markup and not just the tool).
> 
> Acrobat is not an option as we do not use windows or mac in this production
> environment.
> 
> In any case, its a GUI tool, and very inefficient option for such large
> documents.
> 
>>
>> I don't know about Ghostscript, which has advanced greatly since
>> I used it for any serious work, but here are a couple of
>> alternatives.
>>
>> 1. Solid PDF Creator Plus.
>>     It claims to to do searchable OCR embedding.
>>
>>     Not open source, but also not expensive. I have just
>>     got a free copy because I had previously bought a copy of
>>     an ebook from 'turbocash'
>>
>>
>>  
> 
> This again is a windows tool.
> 
> Your response raises questions about whether PDF is an open spec or not. If
> PDF has a feature, it should be documented in the standard somewhere. I am
> not asking about implementations - I am asking about the precise markup
> needed.
> 
> If you refer to the original post, I have tried pdfmark with various
> promising sounding options and they do not do what I am trying to do.
> 
>> 2. Using LaTeX and the 'attachfile' package. With this you can embed any
>>     file into your pdf, and your users can later extract it.
>>
>> http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html
> 
> This is irrelevant. I am looking to create searchable image PDFs, not
> extractable ones.


Is pdf an open spec or not?
The best answer is found, as far as I know, at

http://www.adobe.com/devnet/pdf/pdf_reference.html

(Roughly, pdf 1.7 is entirely open equivalent to ISO 32000.
Adobe's later versions  not so, but will be documented
by Adobe. We can hope that any third parties who provide
extensions, will also document them.)


Although neither 'layers' nor 'searchable images' are mentioned as
such, I suspect that all the information you need to 'roll your own'
is there somewhere, most likely in the sections about transparency, 
transparency groups, and so on.

Good luck,
Ken.




Message-ID:<g9v696$84m$1@aioe.org>
Subject:

Creating searchable image PDFs using Ghostscript


Date:Sun, 7 Sep 2008 01:12:21 +0100


Hello,

What I have :

1. A bunch of pnm images generated by xsane from a manual.
2. Subsequently, these images are split up and cleaned using unpaper.
3. A working installation of ghostscript on Ubuntu.
4. A growing knowledge of the internal structure of PDFs as gleaned from the
Adobe PDF reference and the pdfmark reference.
5. A fine OCR tool (which is accurate enough for my purposes) - tesseract.
6. The following shell script that I wrote :

(just a segment)

<within a loop>

        convert -density 300 outputimage-$m.pgm outputimage-$m.tiff
        tesseract outputimage-$m.tiff outputimage-$m    
        rm outputimage-$m.tiff
        convert -density 300 outputimage-$n.pgm outputimage-$n.tiff
        tesseract outputimage-$n.tiff outputimage-$n    
        rm outputimage-$n.tiff
        potrace --level3 --backend postscript -r300 outputimage-$m.pgm -o
page-$m.ps      
        potrace --level3 --backend postscript -r300 outputimage-$n.pgm -o
page-$n.ps      
        echo "[ /Author (User)
        /Creator (User)
        /Producer (Ghostscript+potrace)
        /Keywords (`cat outputimage-$m.txt`)
        /DOCINFO pdfmark
        /F (outputimage-$m.txt) (r) file def
        [ /_objdef {mystream} /type /stream /OBJ pdfmark
        [ {mystream} F /PUT pdfmark
        [ /MyPrivateAnnotmyStreamData {mystream}
         /SubType /Text
         /Rect [ 10 10 30 30 ]
         /Contents (`cat outputimage-$m.txt`)
         /SrcPg 1
         /Open false
         /Color [1 1 0]
         /Title (Tesseract - OCR)
         /ANN pdfmark
        [ /Name <feff 0041 0073>
        /FS<<
                /Type /Filespec
                /F (outputimage-$m.txt)
                /EF << /F {fstream} >>
                >>
        /EMBED pdfmark
        [ /PageMode /UseOutline
        /Page 1 /View [/Fit]
        /DOCVIEW pdfmark
        [ /Subtype /Text
        /Title (Tesseract - OCR)
        /Alt (`cat outputimage-$m.txt`)
        /StPNE pdfmark
        [ {Catalog} <</Markinfo <</Marked true>>>> /PUT pdfmark" > pdfmarks-$m

gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -r300x300 -sPAPERSIZE=letter -dDOPDFMARKS -dPDFSETTINGS=/ebook -dPermissions=-4 -sOutputFile=pageannotated-$m.pdf
page-$m.ps pdfmarks     

<loop ends>

pdftk pageannotated-*.pdf cat output pdfjoined.pdf


What I do not have :

1. Properly created PDFs that are searchable.

Any ideas how one might embed OCR'ed text invisibly behind the page ?

I am aware of the amazing tool called gscan2pdf that does what I am seeking
to do here using a perl library called PDF2:API. That tool however makes
temporary copies of all the pnm and other files on the way and requires a
gigantic amount of temporary storage space (think 10G+) for the number of
pages I am trying to put together in a searchable PDF manual (think 1400
pages).




Message-ID:<ga0454$p1t$1$8300dec7@news.demon.co.uk>
Subject:

Re: Creating searchable image PDFs using Ghostscript


Date:Sun, 7 Sep 2008 09:42:09 +0100


Geico Caveman wrote:
> 
> 1. Properly created PDFs that are searchable.
> 
> Any ideas how one might embed OCR'ed text invisibly behind the page ?

I believe Acrobat, when it embeds OCR'ed text invisibly, uses
'layers' -- they can be switched on and off.

http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/


I don't know about Ghostscript, which has advanced greatly since
I used it for any serious work, but here are a couple of
alternatives.

1. Solid PDF Creator Plus.
    It claims to to do searchable OCR embedding.

    Not open source, but also not expensive. I have just
    got a free copy because I had previously bought a copy of
    an ebook from 'turbocash'


 
http://www.soliddocuments.com/features.htm?product=SolidPDFCreator#CreateScanToPDF


2. Using LaTeX and the 'attachfile' package. With this you can embed any
    file into your pdf, and your users can later extract it.

http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html


Cheers,
Ken








Message-ID:<ga2o37$7gl$1$8302bc10@news.demon.co.uk>
Subject:

Re: Creating searchable image PDFs using Ghostscript


Date:Mon, 8 Sep 2008 09:34:43 +0100


Geico Caveman wrote:
> Thanks for your response.
> 
>>> Any ideas how one might embed OCR'ed text invisibly behind the page ?
>> I believe Acrobat, when it embeds OCR'ed text invisibly, uses
>> 'layers' -- they can be switched on and off.
>>
>> http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/
> 
> Since PDF is an open spec, this should be documented somewhere (the
> technique and the markup and not just the tool).
> 
> Acrobat is not an option as we do not use windows or mac in this production
> environment.
> 
> In any case, its a GUI tool, and very inefficient option for such large
> documents.
> 
>>
>> I don't know about Ghostscript, which has advanced greatly since
>> I used it for any serious work, but here are a couple of
>> alternatives.
>>
>> 1. Solid PDF Creator Plus.
>>     It claims to to do searchable OCR embedding.
>>
>>     Not open source, but also not expensive. I have just
>>     got a free copy because I had previously bought a copy of
>>     an ebook from 'turbocash'
>>
>>
>>  
> 
> This again is a windows tool.
> 
> Your response raises questions about whether PDF is an open spec or not. If
> PDF has a feature, it should be documented in the standard somewhere. I am
> not asking about implementations - I am asking about the precise markup
> needed.
> 
> If you refer to the original post, I have tried pdfmark with various
> promising sounding options and they do not do what I am trying to do.
> 
>> 2. Using LaTeX and the 'attachfile' package. With this you can embed any
>>     file into your pdf, and your users can later extract it.
>>
>> http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html
> 
> This is irrelevant. I am looking to create searchable image PDFs, not
> extractable ones.


Is pdf an open spec or not?
The best answer is found, as far as I know, at

http://www.adobe.com/devnet/pdf/pdf_reference.html

(Roughly, pdf 1.7 is entirely open equivalent to ISO 32000.
Adobe's later versions  not so, but will be documented
by Adobe. We can hope that any third parties who provide
extensions, will also document them.)


Although neither 'layers' nor 'searchable images' are mentioned as
such, I suspect that all the information you need to 'roll your own'
is there somewhere, most likely in the sections about transparency, 
transparency groups, and so on.

Good luck,
Ken.




Message-ID:<g9v696$84m$1@aioe.org>
Subject:

Creating searchable image PDFs using Ghostscript


Date:Sun, 7 Sep 2008 01:12:21 +0100


Hello,

What I have :

1. A bunch of pnm images generated by xsane from a manual.
2. Subsequently, these images are split up and cleaned using unpaper.
3. A working installation of ghostscript on Ubuntu.
4. A growing knowledge of the internal structure of PDFs as gleaned from the
Adobe PDF reference and the pdfmark reference.
5. A fine OCR tool (which is accurate enough for my purposes) - tesseract.
6. The following shell script that I wrote :

(just a segment)

<within a loop>

        convert -density 300 outputimage-$m.pgm outputimage-$m.tiff
        tesseract outputimage-$m.tiff outputimage-$m    
        rm outputimage-$m.tiff
        convert -density 300 outputimage-$n.pgm outputimage-$n.tiff
        tesseract outputimage-$n.tiff outputimage-$n    
        rm outputimage-$n.tiff
        potrace --level3 --backend postscript -r300 outputimage-$m.pgm -o
page-$m.ps      
        potrace --level3 --backend postscript -r300 outputimage-$n.pgm -o
page-$n.ps      
        echo "[ /Author (User)
        /Creator (User)
        /Producer (Ghostscript+potrace)
        /Keywords (`cat outputimage-$m.txt`)
        /DOCINFO pdfmark
        /F (outputimage-$m.txt) (r) file def
        [ /_objdef {mystream} /type /stream /OBJ pdfmark
        [ {mystream} F /PUT pdfmark
        [ /MyPrivateAnnotmyStreamData {mystream}
         /SubType /Text
         /Rect [ 10 10 30 30 ]
         /Contents (`cat outputimage-$m.txt`)
         /SrcPg 1
         /Open false
         /Color [1 1 0]
         /Title (Tesseract - OCR)
         /ANN pdfmark
        [ /Name <feff 0041 0073>
        /FS<<
                /Type /Filespec
                /F (outputimage-$m.txt)
                /EF << /F {fstream} >>
                >>
        /EMBED pdfmark
        [ /PageMode /UseOutline
        /Page 1 /View [/Fit]
        /DOCVIEW pdfmark
        [ /Subtype /Text
        /Title (Tesseract - OCR)
        /Alt (`cat outputimage-$m.txt`)
        /StPNE pdfmark
        [ {Catalog} <</Markinfo <</Marked true>>>> /PUT pdfmark" > pdfmarks-$m

gs -q -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -r300x300 -sPAPERSIZE=letter -dDOPDFMARKS -dPDFSETTINGS=/ebook -dPermissions=-4 -sOutputFile=pageannotated-$m.pdf
page-$m.ps pdfmarks     

<loop ends>

pdftk pageannotated-*.pdf cat output pdfjoined.pdf


What I do not have :

1. Properly created PDFs that are searchable.

Any ideas how one might embed OCR'ed text invisibly behind the page ?

I am aware of the amazing tool called gscan2pdf that does what I am seeking
to do here using a perl library called PDF2:API. That tool however makes
temporary copies of all the pnm and other files on the way and requires a
gigantic amount of temporary storage space (think 10G+) for the number of
pages I am trying to put together in a searchable PDF manual (think 1400
pages).




Message-ID:<ga0454$p1t$1$8300dec7@news.demon.co.uk>
Subject:

Re: Creating searchable image PDFs using Ghostscript


Date:Sun, 7 Sep 2008 09:42:09 +0100


Geico Caveman wrote:
> 
> 1. Properly created PDFs that are searchable.
> 
> Any ideas how one might embed OCR'ed text invisibly behind the page ?

I believe Acrobat, when it embeds OCR'ed text invisibly, uses
'layers' -- they can be switched on and off.

http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/


I don't know about Ghostscript, which has advanced greatly since
I used it for any serious work, but here are a couple of
alternatives.

1. Solid PDF Creator Plus.
    It claims to to do searchable OCR embedding.

    Not open source, but also not expensive. I have just
    got a free copy because I had previously bought a copy of
    an ebook from 'turbocash'


 
http://www.soliddocuments.com/features.htm?product=SolidPDFCreator#CreateScanToPDF


2. Using LaTeX and the 'attachfile' package. With this you can embed any
    file into your pdf, and your users can later extract it.

http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html


Cheers,
Ken








Message-ID:<ga2o37$7gl$1$8302bc10@news.demon.co.uk>
Subject:

Re: Creating searchable image PDFs using Ghostscript


Date:Mon, 8 Sep 2008 09:34:43 +0100


Geico Caveman wrote:
> Thanks for your response.
> 
>>> Any ideas how one might embed OCR'ed text invisibly behind the page ?
>> I believe Acrobat, when it embeds OCR'ed text invisibly, uses
>> 'layers' -- they can be switched on and off.
>>
>> http://www.acrobatusers.com/forums/ask_an_expert/questions/browse/layers/
> 
> Since PDF is an open spec, this should be documented somewhere (the
> technique and the markup and not just the tool).
> 
> Acrobat is not an option as we do not use windows or mac in this production
> environment.
> 
> In any case, its a GUI tool, and very inefficient option for such large
> documents.
> 
>>
>> I don't know about Ghostscript, which has advanced greatly since
>> I used it for any serious work, but here are a couple of
>> alternatives.
>>
>> 1. Solid PDF Creator Plus.
>>     It claims to to do searchable OCR embedding.
>>
>>     Not open source, but also not expensive. I have just
>>     got a free copy because I had previously bought a copy of
>>     an ebook from 'turbocash'
>>
>>
>>  
> 
> This again is a windows tool.
> 
> Your response raises questions about whether PDF is an open spec or not. If
> PDF has a feature, it should be documented in the standard somewhere. I am
> not asking about implementations - I am asking about the precise markup
> needed.
> 
> If you refer to the original post, I have tried pdfmark with various
> promising sounding options and they do not do what I am trying to do.
> 
>> 2. Using LaTeX and the 'attachfile' package. With this you can embed any
>>     file into your pdf, and your users can later extract it.
>>
>> http://www.ctan.org/tex-archive/help/Catalogue/entries/attachfile.html
> 
> This is irrelevant. I am looking to create searchable image PDFs, not
> extractable ones.


Is pdf an open spec or not?
The best answer is found, as far as I know, at

http://www.adobe.com/devnet/pdf/pdf_reference.html

(Roughly, pdf 1.7 is entirely open equivalent to ISO 32000.
Adobe's later versions  not so, but will be documented
by Adobe. We can hope that any third parties who provide
extensions, will also document them.)


Although neither 'layers' nor 'searchable images' are mentioned as
such, I suspect that all the information you need to 'roll your own'
is there somewhere, most likely in the sections about transparency, 
transparency groups, and so on.

Good luck,
Ken.




 

|THBPdf| |Download| |Developers|