Digitizing an Image Capture PDF of My Belief by Hermann Hesse
I recently picked up my copy of The Glass Bead Game by Hermann Hesse intending to give it another read. Instead I was immediately distracted by something Ziolkowski mentions in the foreword.
In several essays that he wrote around 1920—most notably in pieces on Nietzsche and Dostoevsky—Hesse argued that men must seek a new morality that, transcending the conventional dichotomy of good and evil, will embrace all extremes of life in one unified vision. A later essay, “A Bit of Theology” (1932), outlines the three-stage progression toward this goal. The child, he says, is born into a state of unity with all being. It is only when the child is taught about good and evil that he advances to a second level of individuation characterized by despair and alienation; for he has been made aware of laws and moral codes, but feels incapable of adhering to the arbitrary standards established by conventional religious or moral systems since they exclude so much of what seems perfectly natural.
That really inspired me to take a peek at those essays!
They can be found in a collection titled My Belief that was published in 1974. I was not entirely surprised to discover that the collection has been out of print for some time. I was surprised to discover that I could only find heavily used copies online going for over $50! To make matters worse the only digital copy I could find was an image capture PDF on The Internet Archive.
I do most of my reading on an e-reader and value a good quality EPUB that I can convert to a PDF. In that regard, an image capture PDF is probably the worst quality digital conversion possible.
As if I wasn't already far enough off track, I then decided to convert this amazing collection to a high quality EPUB. I love Hesse and these essays should be easily available to everyone in the quality they deserve.
I also don't have a job and have far too much time on my hands.
Pre-Processing the Images
The pages have already had some OCR done to them but I wanted to be able to easily automate as much of this process as possible.
Here is an example of one of the pages. The contrast must be adjusted if we wan to get a usable output from Tesseract.
After importing them into Acrobat and re-exporting as JPEG I was able to grab some processed versions of each page.
Extracting the Text For Each Essay
I used Tesseract, an open source OCR engine, to grab the text from each page. I wrote a small script in python to extract the text from a specified range of pages.
for i in range(page_start, page_end+1):
prefix = ""
if i <= 99:
prefix = "0"
content = subprocess.run([
'tesseract',
"./resources/image_exports/My belief _ essays on life and art - Hesse, Hermann, 1877-1962_Page_{}{}_Image_0002.jpg".format(prefix, i),
"-",
"-l",
"eng"
], stdout=subprocess.PIPE).stdout.decode("utf-8")
I then needed to do some cleaning on the output as it is far from ideal. Here is an excerpt from the raw output of a page.
xii) : INTRODUCTION
man and foreign, appreciative essays on favorite writers,
and introductions to various anthologies as well as to edi-
tions of works from world literature. A second major cate-
gory includes essays of a more personal nature. These are
sometimes frankly confessional like the “Letter to a Philis-
tine.” And at times they amount to such statements of per-
sonal credo as “My Belief” or “A Bit of Theology.” Finally,
there remains a substantial group of essays that address
themselves to frankly political issues or, more generally, to
questions of cultural criticism.
...
There are a few issues.
- The first two lines of each page are useless captures of the page header.
- Words are split as they wrap to the next line in order to keep a consistent column width.
- Each line has a \n character. I want them to be blocks of text that only contain
\n\t
for paragraphs.
Here is the script I wrote up to clean these captures.
# Remove first two lines to remove the consistent headers on most pages
content = '\n'.join(content.split('\n')[2:])
# remove any dash then newline
content = content.replace("-\n", "")
# replace newlines with spaces
content = re.sub(r"([^^])(\n)", r"\1 ", content)
# replace lone newlines with paragraph tab
content = re.sub(r"\n", r"\n\t", content)
I then had to manually define an array that contained every essay and its corresponding page ranges within a tuple in order to automate this entire process. This was pretty annoying not gonna lie... there are over 70 essays.
essays = [
("letter_to_a_young_poet", 35, 38),
("old_music", 39, 43),
("letter_to_a_philistine", 44, 49),
("language", 50, 55),
("the_refuge", 56, 60),
...
Here is an example of a text file produced by my script!
As you can see the drop cap at the beginning of each essay confuses Tesseract and there are types here and there where an artifact from the image is interpreted as a character.
Proofreading and Converting to EPUB
I then copied each essay into a Pages document and corrected any errors as I read each essay.
This took a while, but I will never complain about reading anything written by Hesse!
The Final Product
Here is a link to the EPUB.
It is truly a beautiful collection that I converted and uploaded with the deepest respect. As it has been out of print for some time there is nothing to lose by making this more accessible, and everything to gain!
Everyone should have the opportunity to read such an invaluable document and I just wanted to do what I could to make that happen.
I believe Hesse would be happy to see his essays in the hands of modern readers.
I highly recommend the following essays:
- On Reading Books
- About Dostoevsky
- On Little Joys