THBPdf Download Contact Us Buy Online Developerse-mail me

Fixing PDF file with strange encoding




Message-ID:<2956778.SAYty65iCy@sable.nothingisreal.com>
Subject:

Fixing PDF file with strange encoding


Date:Fri, 14 Nov 2008 18:41:59 +0100


Greetings.

I have a large number of English-language PDF files that I need to index. 
Unfortunately, these files seem to have some strange font encoding; when
one copies and pastes text from them with a PDF viewer, or converts them
to text using xpdf's pdftotext, one gets gibberish.  But it's consistent
gibberish; that is, there's a one-to-one mapping of English characetrs to
gibberish ones.  For example, space is always '=', 'a' is always '~', 'd'
is always 'Ç', and so on.

It seems that this sort of weird character encoding is not unusual.  A
Google search for 'íÜÉ', which is the encoded version of 'the' brings up
almost a million hits, seemingly all of them PDF documents:
<http://www.google.com/search?q=%C3%AD%C3%9C%C3%89>

I'd like to know how first of all, what this encoding is and where it comes
from, and secondly, how I can automatically convert these files to a
standard encoding, such as UTF-8.  Ideally I'd like to keep them as PDF
files, but even if I could get raw, unformatted text that would be a step. 
(I realize I can do the latter by using copy and paste to manually figure
out the mapping between characters and then use pdftotext and tr, but that
would be quite laborious; I'm hoping there's a ready-made tool.)

Regards,
Tristan

-- 
   _
  _V.-o  Tristan Miller [en,(fr,de,ia)]  ><  Space is limited
 / |`-'  -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  <>  In a haiku, so it's hard
(7_\\    http://www.nothingisreal.com/   ><  To finish what you




Message-ID:<2956778.SAYty65iCy@sable.nothingisreal.com>
Subject:

Fixing PDF file with strange encoding


Date:Fri, 14 Nov 2008 18:41:59 +0100


Greetings.

I have a large number of English-language PDF files that I need to index. 
Unfortunately, these files seem to have some strange font encoding; when
one copies and pastes text from them with a PDF viewer, or converts them
to text using xpdf's pdftotext, one gets gibberish.  But it's consistent
gibberish; that is, there's a one-to-one mapping of English characetrs to
gibberish ones.  For example, space is always '=', 'a' is always '~', 'd'
is always 'Ç', and so on.

It seems that this sort of weird character encoding is not unusual.  A
Google search for 'íÜÉ', which is the encoded version of 'the' brings up
almost a million hits, seemingly all of them PDF documents:
<http://www.google.com/search?q=%C3%AD%C3%9C%C3%89>

I'd like to know how first of all, what this encoding is and where it comes
from, and secondly, how I can automatically convert these files to a
standard encoding, such as UTF-8.  Ideally I'd like to keep them as PDF
files, but even if I could get raw, unformatted text that would be a step. 
(I realize I can do the latter by using copy and paste to manually figure
out the mapping between characters and then use pdftotext and tr, but that
would be quite laborious; I'm hoping there's a ready-made tool.)

Regards,
Tristan

-- 
   _
  _V.-o  Tristan Miller [en,(fr,de,ia)]  ><  Space is limited
 / |`-'  -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  <>  In a haiku, so it's hard
(7_\\    http://www.nothingisreal.com/   ><  To finish what you




Message-ID:<2956778.SAYty65iCy@sable.nothingisreal.com>
Subject:

Fixing PDF file with strange encoding


Date:Fri, 14 Nov 2008 18:41:59 +0100


Greetings.

I have a large number of English-language PDF files that I need to index. 
Unfortunately, these files seem to have some strange font encoding; when
one copies and pastes text from them with a PDF viewer, or converts them
to text using xpdf's pdftotext, one gets gibberish.  But it's consistent
gibberish; that is, there's a one-to-one mapping of English characetrs to
gibberish ones.  For example, space is always '=', 'a' is always '~', 'd'
is always 'Ç', and so on.

It seems that this sort of weird character encoding is not unusual.  A
Google search for 'íÜÉ', which is the encoded version of 'the' brings up
almost a million hits, seemingly all of them PDF documents:
<http://www.google.com/search?q=%C3%AD%C3%9C%C3%89>

I'd like to know how first of all, what this encoding is and where it comes
from, and secondly, how I can automatically convert these files to a
standard encoding, such as UTF-8.  Ideally I'd like to keep them as PDF
files, but even if I could get raw, unformatted text that would be a step. 
(I realize I can do the latter by using copy and paste to manually figure
out the mapping between characters and then use pdftotext and tr, but that
would be quite laborious; I'm hoping there's a ready-made tool.)

Regards,
Tristan

-- 
   _
  _V.-o  Tristan Miller [en,(fr,de,ia)]  ><  Space is limited
 / |`-'  -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=  <>  In a haiku, so it's hard
(7_\\    http://www.nothingisreal.com/   ><  To finish what you




 

|THBPdf| |Download| |Developers|