Skip to main content
Working from scratch, following simplicity

PDF2CBZ for a better legibility on eReader

I admit it: my ebook reader has many problem in loading some type of PDFs, though I had made various tricks to fix this problem, like my site can testify! In last days I have tried a new approach, the conversion to CBZ, a type of archive file for the purpose of sequential viewing of images especially comic books. And for now, it works!

Last week was terrible for my eyes, I wasn't able to read any numbers of my newspaper, Il Fatto Quotidiano, using my Asus DR900 eReader. I tested with my script and the two original versions available online. Nothing to do! The device would block when it loads the file. Also my netbook had loadings more slow, above all in opening of these PDFs.

The situation was unbearable for me, I picked up again my set_uniform_pagination.sh script and I tried to fix this problem for ever.

The "definitive" solution

My initial idea was a double conversion:

  • PDF uniformed --> images
  • images --> PDF

So I tested some solutions found on internet, employing both ImageMagick and Ghostscript (the second is the best for quality and low hardware requirements). But in the second conversion, the resulting PDF had some problem with the text quality, as you can see in the pictures below.

 Original version  Conversion JPGs to PDF  Conversion PNGs to PDF
Original PDF
Size: 45.5 MB
PDF form JPG images
Size: 11.6 MB
PDF from PNG images
Size: 18.1 MB
(the output seems good, but in the reader the text is partially hidden)

 

I put the better commands (you can remove the "#" character, to test them) in this section of the script:

#######################################################################
# FIRST SOLUTION (POOR OVERALL QUALITY)
# convert all the PDF pages to various image formats and convert again to PDF
#

If you have any suggestions, please tell me!

I didn't like any PDF outputs, but I had good images with a problem: my eReader doesn't read them! How to use them? It's simple as a comic book archive, ie a series of image files, typically PNG (lossless compression) or JPEG (lossy compression) files, stored as a single archive file.

So I elaborate the second solution:

  1. extraction of the images from PDF (formatting them in grey scale, but I insert in the script a link with other available outputs);
  2. archive them using zip archive type;
  3. rename ZIP file in CBZ.

It isn't anymore a PDF, but it works very well, without exhausting loadings and slowly changing pages.

Conversion PNGs to CBZ
Conversion PNG to CBZ
Size: 21.7 MB
(the final output is in grey scale, because my eReader is colored in black and white)

Here it is the pdf2cbz_with_uniform_pagination.sh script and below the bash code:

#!/bin/bash
# Script written by Nicola Rainiero
# Available at http://rainnic.altervista.org
#
# This work is licensed under the Creative Commons Attribution 3.0 Italy License.
# To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/it/
#
# Requirements: pdfinfo, awk, LaTex with pdfpages and ifthen packages,imagemagick, ghostcript, zip
# Usage: pdf2cbz_with_uniform-pagination.sh INPUT_FILE.pdf
#
# If you don't have the permission to execute it, run in a terminal:
# chmod +x pdf2cbz_with_uniform-pagination.sh
#

if [ -n "$1" ]
then
	document=$1 # check if exist an input PDF file
else
	echo Missing input PDF 'file'!!
	exit 0
fi

echo $document
# read the exact number of page in the PDF file and write it in "pagine" variable
echo `pdfinfo $document | awk ' $1=="Pages:" {print $2}'` > input.txt 
pagine=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')

echo "The pages of this document are $pagine"
echo '% File di conversione' > latex.tex
# initialize the latex document: the default page layout is "portrait"
# to have the whole document pages changed to "landscape"
echo '\documentclass[a4paper,portrait]{minimal}' >> latex.tex;
echo '\usepackage[pdftex,portrait]{geometry}' >> latex.tex;
echo '\usepackage{pdfpages}' >> latex.tex;
echo '\usepackage{ifthen}' >> latex.tex;
echo '\newcounter{pg}' >> latex.tex;
echo '\begin{document}' >> latex.tex;

# read the horizontal dimension of the first page ("-f 1" option) and save it in: rifh
echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt
readH=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
echo "The width of the first page is $readH pt"
rifh=$( echo "($readH+0.5)/1" | bc )
echo "(Round it to $rifh)"
# read the vertical dimension of the first page ("-f 1" option) and save it in: rifv
echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $5}'` > input.txt
readV=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
echo "The height of the first page is $readV pt"
rifv=$( echo "($readV+0.5)/1" | bc )
echo "(Round it to $rifv)"
echo "----------------"

# check for every page the corresponding horizontal dimension
# and compare it with the "rifh" variable
for i in `seq 1 $pagine`
do
   #
   # echo `pdfinfo -f $i -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt
   # removed because it gives the following error:
   # "Command Line Error: Wrong page range given: the first page ("selected page") can not
   # be after the last page (previous page)"
   #
   # The new command works better:
   echo `pdfinfo -l $pagine -box $document | awk '$2=='"$i"' && $3=="MediaBox:" {print $6}'` > input.txt
   h=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
    echo "For the page $i the width is $h pt"
    h=$( echo "($h+0.5)/1" | bc )
    echo "(Round it to $h)"
    if [[ "$h" -gt "$rifh+200" ]]
	then
		echo 'split this' page
		echo '   \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex;
		echo '   \includepdf[pages='$i',viewport='$rifh' 0 '$h' '$rifv']{'$document'} ' >> latex.tex;
	else
		echo 'do' not 'split this' page
		echo '   \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex;
	fi
done

# close the latex document and make pdf --> latex.pdf
echo '\end{document} ' >> latex.tex;
pdflatex latex.tex 

# save in "nomefile" variable the exact name of the input file
nomefile=${1%%.*}
echo $nomefile

# rename the output pdf file
mv latex.pdf "$nomefile"_uniform.pdf

# clean the exceding files for LaTeX
rm input.txt
rm latex*

#######################################################################
# FIRST SOLUTION (PROBLEM WITH TEXT AND OVERALL QUALITY)
# convert all the PDF pages to various image formats and convert again to PDF
#

# Using Imagemagick JPEG/PNG/TIFF
# compression level and quality:
# http://www.imagemagick.org/script/command-line-options.php#quality
#

# Stage 1: PDF to images
#
## Poor quality for vector graphics
#convert *.png "$nomefile"_convert.pdf
#convert -density 288 "$nomefile"_uniform.pdf -resize 25% pag_%02d.png
## Good compromise quality/weight of the images
#convert -density 200x200 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.png
## Good compromise quality/weight of the images, massive use of RAM memory during conversion
#convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.tiff
##Only slim files, but poor quality:
#convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.jpg

# Stage 2: images to PDF
#
#convert *.png -quality 90 -set units PixelsPerInch "$nomefile"_convert.pdf
#convert *.tiff -set units PixelsPerInch "$nomefile"_convert.pdf
#convert *.jpg -quality 75 -set units PixelsPerInch "$nomefile"_convert.pdf

##optimize "$nomefile"_convert.pdf and rename it in "nomefile" plus the ebook label
#gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$nomefile"_ebook.pdf "$nomefile"_convert.pdf

##clean the exceding files
#rm "$nomefile"_convert.pdf
#rm "$nomefile"_uniform.pdf
#rm pag_*

#######################################################################
# SECOND AND "DEFINITIVE" SOLUTION
# convert all the PDF pages to various image formats and convert to PDF
#

##Using Ghostscript (BEST RESULTS)
# list of output devices available:
# http://pages.cs.wisc.edu/~ghost/doc/AFPL/devices.htm
#
##Good quality for raster graphics (jpeg --> JPEG format with RGB output):
#gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=jpeggray -r300x300 -dJPEGQ=100 -sOutputFile=pag_%02d.jpg "$nomefile"_uniform.pdf
##Best quality for vector graphics (png16m --> PNG format with 24-bit color output, pnggray):
gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pnggray -r300x300 -sOutputFile=pag_%02d.png "$nomefile"_uniform.pdf

##zip, rename to cbz and clean images
#zip "$nomefile" pag_* # it doesn't work if the pages exceed the hundreds
zip "$nomefile" pag_* | sort -n -t _ -k 2 # it works better!
mv  "$nomefile".zip "$nomefile".cbz
rm pag_*

##clean the exceding files
rm "$nomefile"_uniform.pdf

# EXIT
exit 0

 


References

Add new comment

The content of this field is kept private and will not be shown publicly.

Plain text

  • No HTML tags allowed.
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.

Add new comment

The content of this field is kept private and will not be shown publicly.

Plain text

  • No HTML tags allowed.
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Sponsored Links
Pubblicità

Nicola Rainiero

A civil geotechnical engineer with the ambition to facilitate own work with free software for a knowledge and collective sharing. Also, I deal with green energy and in particular shallow geothermal energy. I have always been involved in web design and 3D modelling.