Poppler: Displaying PDF Files with Qt

Материал из Wiki.crossplatform.ru

Версия от 18:54, 23 июня 2009; 90.189.173.121 (Обсуждение)

(разн.) ← Предыдущая | Текущая версия (разн.) | Следующая → (разн.)


Qt Quarterly \| Выпуск 27 \| Документация

by David Boddie

Как мы видели ранее, Qt может быть использована для генерации документов с постоянно расширяемым спектром форматов, которые можно просматривать и редактировать из внешних приложений. Qt поставляется с объектами для отображения HTML "из коробки", и может создавать свои собственные "предпросмотры печати", ну а что с другими форматами файлов, которые вне Qt приложений?

Fortunately, there are third party libraries available for some of the things that Qt doesn't provide. One of these is Poppler, a Portable Document Format (PDF) rendering library that forms the basis of a number of widely-used PDF viewing applications. Poppler is a fork of the Xpdf PDF viewer that is licensed under the GNU General Public License. Xpdf can also be obtained under other licensing terms.

Poppler is designed in a way that allows it to be used with any toolkit or framework as long as a suitable rendering backend is available. Qt application developers are fortunate in that there is also a Qt frontend available—a set of Qt-style classes that use Qt classes to describe parts of PDF documents.

In this article, we'll take a brief look at some of the features provided by Poppler in the context of creating a simple PDF viewing application.

Setting Things Up

Developers using Linux should find that Poppler and the Qt 4 frontend are available as a package for most recent distributions. Developers on Windows, Mac OS X, and other Unix platforms can download source code from the poppler.freedesktop.org Web site.

By default, Poppler is built with all kinds of frontends and backends. If you compile Poppler from source, you can exclude some of these to save compile time. When configuring the build, it may be easier to set the installation prefix to that used for the Qt installation—this prefix is the directory under which subdirectories containing executables, libraries and data files are stored.

It is important to know where the Poppler library and header files will be installed because our example will need them.

Отрисовка документов

In our example, we provide a simple user interface to display PDF files, displaying a single page at a time and providing controls to let the user move between pages. Each page is displayed in a custom widget, DocumentWidget, held in the main window's central widget, a scroll area.

The user opens a new file via a file dialog, which we open in response to an action being triggered. The path to the file is passed to the DocumentWidget so that the document it contains can be fed to the Poppler library.

Unlike with many Qt classes, we load a document using a static function in the following way:

    Poppler::Document *doc = Poppler::Document::load(path);

If the document returned is not null, we have a document that we can explore. Note that our example takes ownership of the document, so we must remember to dispose of it when we have finished with it.

Each document contains a series of pages that can be obtained one by one using the Document::page() function. Although the Document class has a collection of functions to control the appearance of the document, actual rendering is performed by each Page object. In our example, we render pages into QImage objects that we display using the DocumentWidget, itself just a simple QLabel subclass.

The key part of our DocumentWidget::showPage() function looks like this:

    void DocumentWidget::showPage(int page)
    {
        QImage image = doc->page(currentPage)->renderToImage(
                            scaleFactor * physicalDpiX(),
                            scaleFactor * physicalDpiY());
        ...
        setPixmap(QPixmap::fromImage(image));
    }

In the above code we pass the resolution of the image to be created, multiplied by a scale factor that the user controls via the example's user interface. We have to be careful with the range of scale factors available because it is easy to request extremely large images. In practice, we restrict the user's choice to a set of predefined scale factors.

Поиск текста

One of the many useful features that Poppler provides is the ability to locate specific text strings in PDF documents. Since PDF is designed to store printable rather than editable documents, it is not always easy to easily access and reconstruct the author's original text. However, Poppler does a good job of locating text in many documents, and we can expose this feature in our example.

The API for locating text provides conventional features such as case-insensitive and directional searching, but also returns information about the position of any located text on the page—since PDF is a display format, this is really the only useful information about the text we can obtain. This information can be used to indicate where any subsequent searches should begin.

Basically, the code to perform a forward search in a given page looks like this:

    bool found = page->search(text, searchLocation,
                 Poppler::Page::NextResult,
                 Poppler::Page::CaseInsensitive);

Here, searchLocation is a QRectF object that indicates where the search should start from on the given page. Initially, when we perform a search, we just pass a default constructed QRectF object to start from the page origin.

The rectangle we obtain from the Page::search() function can be used when we render the page to highlight the located text and scroll the view to make sure it is visible. However, the position and dimensions of the rectangle are given in points (1 inch = 72 points), so we need to transform the rectangle to cover the correct area on-screen.

Searching through a document for a piece of text is slightly more involved than just a single function call. We'll look at this in more detail later.

Извлечение текста

Since the mapping between the author's original text and its location on-screen may be purely visual, it is difficult to automate the extraction of text from PDF files, though there are tools that try very hard to achieve this.

Many document viewers let the user select and export text by making them select a region on-screen, giving the application something to work with, and Poppler supports this approach by providing a function that returns a string for a given rectangle that we call like this:

    QString text = doc->page(currentPage)->text(selectedRect);

The method we use is somewhat different to this. We'll cover it in more detail later.

Пример в подробностях

Having covered the basics of displaying pages, searching, and extracting text from documents, let's take a closer look at how our example uses these features.

We provide two functions to search for text strings supplied by the user via the user interface. For forwards searching, we start by looking for strings on the current page, beginning at the current search location, then try each following page until the end of the document.

    QRectF DocumentWidget::searchForwards(const QString &amp;text)
    {
        int page = currentPage;
        while (page < doc->numPages()) {
 
            if (doc->page(page)->search(text, searchLocation,
                Poppler::Page::NextResult,
                Poppler::Page::CaseInsensitive)) {
 
                if (!searchLocation.isNull()) {
                    showPage(page + 1);
                    return searchLocation;
                }
            }
            page += 1;
            searchLocation = QRectF();
        }

If we reach the end of the document without finding anything, we search from the beginning until we reach the current page.

      page = 0;
 
        while (page < currentPage) {
 
            searchLocation = QRectF();
 
            if (doc->page(page)->search(text, searchLocation,
                Poppler::Page::NextResult,
                Poppler::Page::CaseInsensitive)) {
 
                if (!searchLocation.isNull()) {
                    showPage(page + 1);
                    return searchLocation;
                }
            }
            page += 1;
        }
 
        return QRectF();
    }

As well as rendering pages at different scales, as shown earlier, we would like to highlight the results of searches. To do this, we insert some code to paint on the image obtained from the current page, using a matrix to map the rectangle onto the image.

    QMatrix DocumentWidget::matrix() const
    {
        return QMatrix(scaleFactor * physicalDpiX() / 72.0, 0,
                       0, scaleFactor * physicalDpiY() / 72.0,
                       0, 0);
    }
 
    void DocumentWidget::showPage(int page)
    {
        ...
 
        QImage image = doc->page(currentPage)->renderToImage(
                       scaleFactor * physicalDpiX(),
                       scaleFactor * physicalDpiY());
 
        if (!searchLocation.isEmpty()) {
            QRect highlightRect = matrix().mapRect(
                                  searchLocation).toRect();
            highlightRect.adjust(-2, -2, 2, 2);
            QImage highlight = image.copy(highlightRect);
            QPainter painter;
            painter.begin(&amp;image);
            painter.fillRect(image.rect(),
                             QColor(0, 0, 0, 32));
            painter.drawImage(highlightRect, highlight);
            painter.end();
        }
 
        setPixmap(QPixmap::fromImage(image));
    }

The result of this additional effort is shown in the following image—the located text is displayed normally while the rest of the page is slightly darker.

In our example, we allow the user to draw a selection onto the page by reimplementing three of the mouse event handler functions in our DocumentWidget. In these we maintain a QRubberBand object to keep track of the area selected, following the pattern shown in the QRubberBand documentation.

The mouse release event handler is where we start the process of selecting text:

    void DocumentWidget::mouseReleaseEvent(QMouseEvent *)
    {
        ...
        if (!rubberBand->size().isEmpty()) {
            QRectF rect = QRectF(rubberBand->pos(),
                                 rubberBand->size());
            rect.moveLeft(rect.left() -
                 (width() - pixmap()->width()) / 2.0);
            rect.moveTop(rect.top() -
                 (height() - pixmap()->height()) / 2.0);
            selectedText(rect);
        }
 
        rubberBand->hide();
    }

When the user releases the mouse button, we create a rectangle with coordinates relative to the top-left corner of the image within the label, and we pass this to the selectedText() function which is responsible for informing the rest of the application about any text it finds.

As noted earlier, the Poppler Page class provides a function to return text within a rectangle in a document. However, in selectedText(), we use a more convoluted method to show how much information we can obtain about a document.

We begin by mapping the selection rectangle onto the page, using the inverse of the matrix we used to highlight search results, before obtaining a list of TextBox objects, each of which describes a piece of text on the page.

    void DocumentWidget::selectedText(const QRectF &amp;rect)
    {
        QRectF selectedRect = matrix().inverted()
                                      .mapRect(rect);
 
        QString text;
        bool hadSpace = false;
        QPointF center;
        foreach (Poppler::TextBox *box,
                 doc->page(currentPage)->textList()) {
 
            if (selectedRect.intersects(box->boundingBox())) {
                if (hadSpace)
                    text += " ";
                if (!text.isEmpty() &amp;&amp;
 
                    box->boundingBox().top() > center.y())
                    text += "\n";
 
                text += box->text();
                hadSpace = box->hasSpaceAfter();
                center = box->boundingBox().center();
            }
        }
 
        if (!text.isEmpty())
            emit textSelected(text);
    }

We test whether each piece of text lies within the selection and append it in a QString if it does. We also perform some elementary checks to see if we can cleverly insert newline characters in appropriate places.

Note that, while we're satisfied with obtaining whole pieces of text (typically words in a sentence), recent versions of Poppler allow the individual characters in TextBox objects to be located.

In the user interface, when the user selects some text, we display it in a text browser so that it can be copied and pasted elsewhere.

Сборка примера

The example is provided as a standard Qt project with a simple pdfviewer.pro file. Because there is a certain amount of freedom associated with where you can install the Poppler library and header files on your system, you will need to modify this file to use the correct paths.

On Ubuntu 8.04 with the libpoppler-qt4-dev package installed, the appropriate paths are as follows:

    INCLUDEPATH  += /usr/include/poppler/qt4
    LIBS         += -L/usr/lib -lpoppler-qt4

Other Linux distributions may install these files in different locations, and developers on other platforms may find it easier to build the library alongside the example instead of installing it.

Прочие возможности и улучшения

Our PDF viewer example only uses the most basic features of the Poppler library. Since many documents use features like encryption, slideshow transitions, tables of contents and annotations, the viewer applications that use Poppler to render documents rely on the library's support for these features.

Poppler includes a number of low level features that are useful for the purpose of analysing PDF files. Access to the list of fonts used in a document and the font data itself can be useful when preparing documents for publication.

Access to the body of text in a document is useful to developers looking to index documents for text mining and subsequent analysis. However, as noted earlier, this might be of limited use for some documents. A good summary of the issues surrounding text extraction can be found on the following page:

http://www.glyphandcog.com/textext.html

Information that is not part of the visible document is also available via the Poppler API. Annotations, scripts (typically written in JavaScript) and the URLs for hyperlinks can all be obtained, though it is up to the application developer to present this information in a meaningful way.

Like Qt's QPrinter class, Poppler is also able to write PostScript files, so we could easily add support for file export and conversion. Recent versions also support PDF output, and this opens the door to the use of the library for PDF manipulation. In fact, since the library allows us to examine documents without having to display pages, it is possible to write command line tools to handle documents, and a number of these are supplied with Poppler.