logo

Searching the text of a G-Book®

G-Books start out as images (scans or camera shots) of individual book or magazine pages assembled into large pdf files. So each page is essentially a photograph, not letters and words of text in a computer file, as in a text-based ebook (such as a plain-text pdf or a typical .epub file). G-Books can also be made text-searchable by adding the text as an invisible layer behind each page's graphical image. You see the image as usual, but the pdf software can search the text behind the image and highlight the image in the location of the matching text. G-Books that include this feature are listed in the store as Search Text: Yes (searchable text layer behind page images). As the technology improves we are producing more searchable text books. We will never remove the top image layer that gives the G-Book its "real" book quality, but where file sizes do not grow dramatically we will be adding more searchable books. There is an important caveat (which applies to many text-based ebooks as well): we don't have the resources to proofread the OCR results, so the hidden text behind the images will only be as accurate as the raw output of the OCR engine. You might search for a word and find it in nine out of ten places where it occurs in the book, but the tenth might not have been caught by the OCR software. Still, 90% can be helpful if you're searching for a specific term, so we think it's worth adding this background text layer when it's feasible.

If you'd like a little more background, text-based ebooks, which alway permit text searches by their nature, don't necessarily escape from the problems of OCR technology. Beside the fact they deny you the original typefaces, layout, art, and other enjoyable aspects of the printed version of the book, many if not most of these public-domain text-based ebooks started out by being scanned from the pages of a printed book just like our G-Books. They also apply optical character recognition (OCR) to each page to convert the graphical images into text, but unlike us they discard the pictorial and graphical layer. So a text-based ebook from scanned sources is also only as accurate as the OCR engine and anyone who proofread and corrected the results (if there even was a proofreader!). In fairness, most text-based ebooks like those from Gutenberg.org are proofread by their volunteers, something we don't have the capacity to perform. If searchable text is most important to you, a text-based ebook might be your best choice. But remember, a G-Book gives you a high-resolution snapshot of every page, so you're assured of every word being just as it was typeset—including even the original typographical errors that identify and make some early book editions especially collectible.

We don't think you'll be losing much if your G-Book isn't searchable. Remember, those wonderful original printed books weren't text-searchable either, and they were still very functional. Any scholarly book that had an index (the original "text-searchable" technology) still has that index in the G-Book version. You can look in the index, find the relevant page numbers and jump to those pages. And a good index provides for a better search experience than simply searching for each occurrence of a term in a book. While text-based ebooks let you search for literal strings of text, they usually break any index the printed book had, because text-based ebooks remove the book's original page breaks and make it into one long stream of text. You can look at the index at the back of an ebook and see that a term occurred on page 317 of the printed book, but where is page 317 in that text-based ebook when it scrolls from top to bottom with no page breaks, or with page breaks artificially created by the size of your reader and the typeface you've chosen? So, instead of using a printed book's index to refer you to the relevant pages for a term, in text-based ebooks you'll have to search every match of a term rather than benefit from a professional indexer's careful selection and arrangement of salient terminology into main and subentries pointing to ranges of pages. Searchable-text G-Books will give you the best of both; you can search for your own terms, knowing the OCR software isn't perfect, and you can look up carefully themed entries in the book's original index and jump to the intact page numbers referenced in the index. If your G-Book doesn't have a text layer (you'll see Search Text: No in its description), then you're in the same boat as if you held the printed copy in your hands; no better, but no worse!