|
Fixing PDF file with strange encoding
Message-ID:<2956778.SAYty65iCy@sable.nothingisreal.com>
Subject:Fixing PDF file with strange encoding
Date:Fri, 14 Nov 2008 18:41:59 +0100
Greetings.
I have a large number of English-language PDF files that I need to index.
Unfortunately, these files seem to have some strange font encoding; when
one copies and pastes text from them with a PDF viewer, or converts them
to text using xpdf's pdftotext, one gets gibberish. But it's consistent
gibberish; that is, there's a one-to-one mapping of English characetrs to
gibberish ones. For example, space is always '=', 'a' is always '~', 'd'
is always 'Ç', and so on.
It seems that this sort of weird character encoding is not unusual. A
Google search for 'íÜÉ', which is the encoded version of 'the' brings up
almost a million hits, seemingly all of them PDF documents:
<http://www.google.com/search?q=%C3%AD%C3%9C%C3%89>
I'd like to know how first of all, what this encoding is and where it comes
from, and secondly, how I can automatically convert these files to a
standard encoding, such as UTF-8. Ideally I'd like to keep them as PDF
files, but even if I could get raw, unformatted text that would be a step.
(I realize I can do the latter by using copy and paste to manually figure
out the mapping between characters and then use pdftotext and tr, but that
would be quite laborious; I'm hoping there's a ready-made tool.)
Regards,
Tristan
--
_
_V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
/ |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
(7_\\ http://www.nothingisreal.com/ >< To finish what you
Message-ID:<2956778.SAYty65iCy@sable.nothingisreal.com>
Subject:Fixing PDF file with strange encoding
Date:Fri, 14 Nov 2008 18:41:59 +0100
Greetings.
I have a large number of English-language PDF files that I need to index.
Unfortunately, these files seem to have some strange font encoding; when
one copies and pastes text from them with a PDF viewer, or converts them
to text using xpdf's pdftotext, one gets gibberish. But it's consistent
gibberish; that is, there's a one-to-one mapping of English characetrs to
gibberish ones. For example, space is always '=', 'a' is always '~', 'd'
is always 'Ç', and so on.
It seems that this sort of weird character encoding is not unusual. A
Google search for 'íÜÉ', which is the encoded version of 'the' brings up
almost a million hits, seemingly all of them PDF documents:
<http://www.google.com/search?q=%C3%AD%C3%9C%C3%89>
I'd like to know how first of all, what this encoding is and where it comes
from, and secondly, how I can automatically convert these files to a
standard encoding, such as UTF-8. Ideally I'd like to keep them as PDF
files, but even if I could get raw, unformatted text that would be a step.
(I realize I can do the latter by using copy and paste to manually figure
out the mapping between characters and then use pdftotext and tr, but that
would be quite laborious; I'm hoping there's a ready-made tool.)
Regards,
Tristan
--
_
_V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
/ |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
(7_\\ http://www.nothingisreal.com/ >< To finish what you
Message-ID:<2956778.SAYty65iCy@sable.nothingisreal.com>
Subject:Fixing PDF file with strange encoding
Date:Fri, 14 Nov 2008 18:41:59 +0100
Greetings.
I have a large number of English-language PDF files that I need to index.
Unfortunately, these files seem to have some strange font encoding; when
one copies and pastes text from them with a PDF viewer, or converts them
to text using xpdf's pdftotext, one gets gibberish. But it's consistent
gibberish; that is, there's a one-to-one mapping of English characetrs to
gibberish ones. For example, space is always '=', 'a' is always '~', 'd'
is always 'Ç', and so on.
It seems that this sort of weird character encoding is not unusual. A
Google search for 'íÜÉ', which is the encoded version of 'the' brings up
almost a million hits, seemingly all of them PDF documents:
<http://www.google.com/search?q=%C3%AD%C3%9C%C3%89>
I'd like to know how first of all, what this encoding is and where it comes
from, and secondly, how I can automatically convert these files to a
standard encoding, such as UTF-8. Ideally I'd like to keep them as PDF
files, but even if I could get raw, unformatted text that would be a step.
(I realize I can do the latter by using copy and paste to manually figure
out the mapping between characters and then use pdftotext and tr, but that
would be quite laborious; I'm hoping there's a ready-made tool.)
Regards,
Tristan
--
_
_V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
/ |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
(7_\\ http://www.nothingisreal.com/ >< To finish what you
|