How to count, edit and translate PDF files

Mano con copie di documenti

Files in PDF are tough to edit and translate. For a PDF to be translated, it will need to be converted to an editable format. How difficult this conversion is may vary (or even become impossible) depending on the type of PDF.

What’s more, to make sure the right conversion procedure and the right tool are being used, you need to know how to distinguish between the different types of PDFs out there.

What is PDF format?

PDF is an abbreviation of Portable Document Format, a file format developed by Adobe in 1993, which allows software and hardware to create and display this document type independently.

Ultimately, a PDF will be displayed and rendered in the same way, regardless of the computer used. This feature has made the format one of the most popular methods of document sharing. For many people, creating a PDF of a document has become the equivalent of “making a digital photocopy” of it, and this is a great advantage in terms of practicality. However, the resulting difficulty that comes up when trying to edit or translate PDFs is often neglected.

A PDF can contain various elements. Some elements do not belong to the visible text: these are called “properties”, and consist of:

  • the author’s name
  • the title
  • the date of creation
  • the tool used to create it, etc.

The other elements consist of the document in itself and are generally:

  • text
  • bitmap images (photographs)
  • vector graphics (lines, some types of diagrams).

How many types of PDFs are there?

When you receive a PDF file, you will first need to check its content to see whether it is:

  1. a “real” PDF, i.e. a document created digitally with a programme like Word or Excel, or with another programme using the “Print” function (virtual printing); this type of PDF may contain text, vector graphics and bitmap images.
  2. A PDF containing a scan of a paper document, made by simply taking a photo of the original document or scanning it in; these are JPG or TIFF images saved as PDF documents which serve to contain them, and their text cannot be selected.
  3. A “hybrid” of the preceding types, basically a PDF which has a visible top layer made of scanned or photographed images, but whose text can be selected and searched for, as it has been converted by an optical character recognition system. Some programs, like those combined with scanners, in addition to copying the paper document exactly, recognise the text during scanning and save it to a layer below the image.

How do I know what type of PDF I am dealing with?

If you want to edit or translate a PDF, you will need to check whether the document’s text actually appears as text, that is, whether or not it can be selected. Simply open the document with Adobe Reader (or with any other PDF visualiser) and click on the text selection icon in the toolbar, or zoom into the document.

If the text appears out of focus or stretched after a certain point, you are looking at a scan. On the other hand, if you zoom in and the text does not loose resolution, the PDF has been generated by a program.

As mentioned in the previous segment, there are also “hybrid” cases, where the document is a scan, but the text can still be selected. In these cases, if you want to extract the text, simply select it. However, check spelling and text accuracy carefully, as optical character recognition systems (OCRs) that extract text from images have a certain margin of error.

If we have ascertained that the PDF is a "real" PDF generated by an application, to find out which application generated it, we need only press consult the document properties (usually with CTRL+D, or File | Document Properties) and read what is in the Description tab.

Under Application (or similar), you should see the name of the program used to create the PDF.

Ideally at this point you should ask your client for the editable file, and specify that you are certain it exists (having just read about it under Document Properties). Having the source file that generated the PDF is the only way you can work comfortably on the document, with the certainty of being able to generate another PDF identical to the original once the translation or editing is finished.

Usually, one way to convince the client is to tell them that they will otherwise incur a surcharge to cover the costs of the conversion process. Obviously, this type of negotiation depends on the relationship established with the client and on the bargaining power you have in each particular case.

Frankly, it may even be the case, especially when it comes to multinationals, that the person who sent you the PDF does not actually have access to the editable file. DTP (layout) services are often done at a company’s head office and the final PDF files are sent to branches to be printed on site. A translation may only have been necessary later on, and in such a case, tracing a file back to its original source can be quite a task.

If, despite all your efforts, you are unable to access the original file, there are still some options at your disposal to help export the text.

TAKE NOTE: It should be stressed right now that no option will result in a file that is perfectly identical to the original, especially if it contains images (bitmaps) and some degree of formatting, or special fonts.

The chosen method, and therefore the degree of precision, also depend on the text’s extraction purpose. There are two different possibilities here:

  1. to have the text available just for word-counting or for copying (and pasting) purposes
  2. to create an editable file as close to the original as possible, which can then be translated or edited.

How to count the words of a PDF

If you only need to count the text of a PDF to be able to estimate the cost of a translation, you won’t even need to extract it. If the PDF text is encoded as text (as we have seen above), you will be able to use one of the following tools:

If you cannot or do not want to use the suggested software and you have Adobe Acrobat (not Adobe Reader), you can extract the text by:

  • opening the PDF file with Adobe Acrobat
  • from the File menu save the document as RTF or DOC.

In this case you may need to apply one or more macros to fix the format, depending on the original document type. For example this Word macro restores correct carriage returns (link to an archived copy of the www.archive.org site, because www.terminologymatters.com is no longer online). One other very efficient macro, in this case for OpenOffice and LibreOffice, is PerfectEpub, an improved version of MyTXTcleaner.

If you do not have Adobe Acrobat:

  • open the file with Adobe Reader
  • choose the text selection tool
  • select the entire text (CTRL+A)
  • copy it (CTRL+C)
  • open Word or any other word processor
  • paste the text (CTRL+V).

Of course, this option is also valid if the text to be analysed or translated is only a part of the document.

Extracting the text of a PDF is also useful in cases where a quick translation of the text is necessary and you cannot or do not want to use the services of a human translator. The text extracted using the methods described in this article can be pasted into an automatic translator. Of course, if you want a high quality translation, our advice is to always rely on specialised professional translators.

How to edit a PDF without changing the format

To maintain PDF formating while editing or translating there are two options:

  • use one of the myriad of programs that convert PDFs to Word format
  • use an OCR like FineReader, OmniPage, ReadIris, etc.

Using programs that guarantee direct conversion without user intervention is not recommended. These programs usually create Word documents that visually maintain the appearance of the original PDF, but achieve this by using very complicated formatting full of text frames, section intervals, columns, styles, and line spacing.

As soon as the document is changed, for example by deleting a sentence or opening it with a computer-assisted translation program, the format breaks apart and more often than not becomes humanly impossible to work on.

Therefore, we recommend performing the conversion with an OCR program. We found that Abbyy FineReader gave us the best results. The best strategy is to modify the default settings manually, i.e. to indicate the distribution of various elements on the page to the program.

If the format not only needs to be maintained, but the client also needs to rebuild the file from scratch, (which is always necessary when the file that produced the PDF no longer exists), we have two options:

  1. either work with a DTP programme (InDesign, Scribus, Inkscape, QuakXPress, etc.) and use the original PDF as a model, or
  2. use Infix, a PDF editor distributed by Iceni.

Iceni PDF Editor (available on subscription or as a single purchase) has a useful feature (TransPDF) which exports PDFs in XLIFF format translation industry standard. This XLIFF file can be translated with any CAT tool. The translated file must then be imported back into the original PDF, again using Infix Professional. The Infix website has a clear video explaining the entire process.

If you need to work with a layout programme instead, you need to use the original PDF as a background template. We recommend reading the following article for more details: “Translation and DTP of a PDF file”.

If you do not need to use an OCR programme regularly and are hesitant to invest for just a one-time use, you can use one of the many online converters, such as Zamzar.com, though the results may suffer some of the same issues as desktop converters.

If the PDF was generated from Microsoft Word, another option that usually gives excellent results is to have the PDF converted back to Microsoft Word. In any case, Word “recognises” that the PDF was generated with the same programme and the conversion is highly-accurate.

Another programme with powerful PDF editing capabilities is Inkscape, the free and open-source vector graphics editor (an alternative to Adobe Illustrator). Inkscape allows you to open and edit PDFs. However, the amount of available options and configurations can range from difficult to overwhelming, and so some familiarity with this programme is required.

How to convert a PDF from a scan?

The above applies to PDFs generated by apps. If the text contained in the PDF is made up of images (as in the typical case of a fax received and then scanned in) the only way to export it to an editable format is by using an OCR program.

How do I convert a password-protected PDF?

Another complication that may arise is the level of security protecting the PDF that needs translating. PDFs can be protected with 2 levels of security: a user password and an owner password. The first prevents the document itself from opening, while the second restricts one or more features, such as printing, copying text, making changes, adding notes, etc.

So, if the author of the PDF has chosen to restrict editing via passwords, it will be impossible to apply the methods described above. In this case, you’ll have to contact the client and ask to be sent the password. In cases where this is not possible, it helps to know that there is a range of tools out there that can decipher owner passwords quickly. Just Google "PDF crack" (even you can find online tools, such as Unlock-PDF). However, in the case of cracking user passwords to access PDFs, the matter is much more complicated. In this case, programs resort to “brute force” methods, which can take hours, even days, to decipher passwords.

WARNING: Please note that the use of these tools may infringe property rights and Qabiria does not encourage their use under any circumstances.

A reminder in chart form

To further clarify the logical steps needed to convert a PDF for translation or editing, we have created a flowchart to serve as a practical guide, which you can download for free and without signing up.

code2flow Me4QFr

Download the diagram "How to translate PDF"

If you know of any other ways to translate or edit PDFs, or want to contribute your own experience to the discussion, don’t hesitate, simply write a comment below.

Need to translate a PDF and don’t know how? Don’t hesitate: Contact us.

NOTE: article originally written on 10/25/2008 and updated on 1/19/2022. Some comments may refer to revised or corrected sections.

Technical translator, project manager, entrepreneur. Languages graduate with an MA in Design and Multimedia Production. He founded Qabiria in 2008.

Further Reading

Chat to one of us

Let us know what you need by sending an email to hola@qabiria.com or by filling in the contact form. We guarantee a response within 24 hours, but usually we’re much faster.

Contact us