PDF2CBZ for a better legibility on eReader

The "definitive" solution

My initial idea was a double conversion:

PDF uniformed --> images
images --> PDF

So I tested some solutions found on internet, employing both ImageMagick and Ghostscript (the second is the best for quality and low hardware requirements). But in the second conversion, the resulting PDF had some problem with the text quality, as you can see in the pictures below.


Original PDF Size: 45.5 MB	PDF form JPG images Size: 11.6 MB	PDF from PNG images Size: 18.1 MB (the output seems good, but in the reader the text is partially hidden)

I put the better commands (you can remove the "#" character, to test them) in this section of the script:

#######################################################################
# FIRST SOLUTION (POOR OVERALL QUALITY)
# convert all the PDF pages to various image formats and convert again to PDF
#

If you have any suggestions, please tell me!

I didn't like any PDF outputs, but I had good images with a problem: my eReader doesn't read them! How to use them? It's simple as a comic book archive, ie a series of image files, typically PNG (lossless compression) or JPEG (lossy compression) files, stored as a single archive file.

So I elaborate the second solution:

extraction of the images from PDF (formatting them in grey scale, but I insert in the script a link with other available outputs);
archive them using zip archive type;
rename ZIP file in CBZ.

It isn't anymore a PDF, but it works very well, without exhausting loadings and slowly changing pages.

Conversion PNG to CBZ
Size: 21.7 MB
(the final output is in grey scale, because my eReader is colored in black and white)

Here it is the pdf2cbz_with_uniform_pagination.sh script and below the bash code:

#!/bin/bash
# Script written by Nicola Rainiero
# Available at http://rainnic.altervista.org
#
# This work is licensed under the Creative Commons Attribution 3.0 Italy License.
# To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/it/
#
# Requirements: pdfinfo, awk, LaTex with pdfpages and ifthen packages,imagemagick, ghostcript, zip
# Usage: pdf2cbz_with_uniform-pagination.sh INPUT_FILE.pdf
#
# If you don't have the permission to execute it, run in a terminal:
# chmod +x pdf2cbz_with_uniform-pagination.sh
#

if [ -n "$1" ]
then
	document=$1 # check if exist an input PDF file
else
	echo Missing input PDF 'file'!!
	exit 0
fi

echo $document
# read the exact number of page in the PDF file and write it in "pagine" variable
echo `pdfinfo $document | awk ' $1=="Pages:" {print $2}'` > input.txt 
pagine=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')

echo "The pages of this document are $pagine"
echo '% File di conversione' > latex.tex
# initialize the latex document: the default page layout is "portrait"
# to have the whole document pages changed to "landscape"
echo '\documentclass[a4paper,portrait]{minimal}' >> latex.tex;
echo '\usepackage[pdftex,portrait]{geometry}' >> latex.tex;
echo '\usepackage{pdfpages}' >> latex.tex;
echo '\usepackage{ifthen}' >> latex.tex;
echo '\newcounter{pg}' >> latex.tex;
echo '\begin{document}' >> latex.tex;

# read the horizontal dimension of the first page ("-f 1" option) and save it in: rifh
echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt
readH=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
echo "The width of the first page is $readH pt"
rifh=$( echo "($readH+0.5)/1" | bc )
echo "(Round it to $rifh)"
# read the vertical dimension of the first page ("-f 1" option) and save it in: rifv
echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $5}'` > input.txt
readV=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
echo "The height of the first page is $readV pt"
rifv=$( echo "($readV+0.5)/1" | bc )
echo "(Round it to $rifv)"
echo "----------------"

# check for every page the corresponding horizontal dimension
# and compare it with the "rifh" variable
for i in `seq 1 $pagine`
do
   #
   # echo `pdfinfo -f $i -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt
   # removed because it gives the following error:
   # "Command Line Error: Wrong page range given: the first page ("selected page") can not
   # be after the last page (previous page)"
   #
   # The new command works better:
   echo `pdfinfo -l $pagine -box $document | awk '$2=='"$i"' && $3=="MediaBox:" {print $6}'` > input.txt
   h=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
    echo "For the page $i the width is $h pt"
    h=$( echo "($h+0.5)/1" | bc )
    echo "(Round it to $h)"
    if [[ "$h" -gt "$rifh+200" ]]
	then
		echo 'split this' page
		echo '   \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex;
		echo '   \includepdf[pages='$i',viewport='$rifh' 0 '$h' '$rifv']{'$document'} ' >> latex.tex;
	else
		echo 'do' not 'split this' page
		echo '   \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex;
	fi
done

# close the latex document and make pdf --> latex.pdf
echo '\end{document} ' >> latex.tex;
pdflatex latex.tex 

# save in "nomefile" variable the exact name of the input file
nomefile=${1%%.*}
echo $nomefile

# rename the output pdf file
mv latex.pdf "$nomefile"_uniform.pdf

# clean the exceding files for LaTeX
rm input.txt
rm latex*

#######################################################################
# FIRST SOLUTION (PROBLEM WITH TEXT AND OVERALL QUALITY)
# convert all the PDF pages to various image formats and convert again to PDF
#

# Using Imagemagick JPEG/PNG/TIFF
# compression level and quality:
# http://www.imagemagick.org/script/command-line-options.php#quality
#

# Stage 1: PDF to images
#
## Poor quality for vector graphics
#convert *.png "$nomefile"_convert.pdf
#convert -density 288 "$nomefile"_uniform.pdf -resize 25% pag_%02d.png
## Good compromise quality/weight of the images
#convert -density 200x200 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.png
## Good compromise quality/weight of the images, massive use of RAM memory during conversion
#convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.tiff
##Only slim files, but poor quality:
#convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.jpg

# Stage 2: images to PDF
#
#convert *.png -quality 90 -set units PixelsPerInch "$nomefile"_convert.pdf
#convert *.tiff -set units PixelsPerInch "$nomefile"_convert.pdf
#convert *.jpg -quality 75 -set units PixelsPerInch "$nomefile"_convert.pdf

##optimize "$nomefile"_convert.pdf and rename it in "nomefile" plus the ebook label
#gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$nomefile"_ebook.pdf "$nomefile"_convert.pdf

##clean the exceding files
#rm "$nomefile"_convert.pdf
#rm "$nomefile"_uniform.pdf
#rm pag_*

#######################################################################
# SECOND AND "DEFINITIVE" SOLUTION
# convert all the PDF pages to various image formats and convert to PDF
#

##Using Ghostscript (BEST RESULTS)
# list of output devices available:
# http://pages.cs.wisc.edu/~ghost/doc/AFPL/devices.htm
#
##Good quality for raster graphics (jpeg --> JPEG format with RGB output):
#gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=jpeggray -r300x300 -dJPEGQ=100 -sOutputFile=pag_%02d.jpg "$nomefile"_uniform.pdf
##Best quality for vector graphics (png16m --> PNG format with 24-bit color output, pnggray):
gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pnggray -r300x300 -sOutputFile=pag_%02d.png "$nomefile"_uniform.pdf

##zip, rename to cbz and clean images
#zip "$nomefile" pag_* # it doesn't work if the pages exceed the hundreds
zip "$nomefile" pag_* | sort -n -t _ -k 2 # it works better!
mv  "$nomefile".zip "$nomefile".cbz
rm pag_*

##clean the exceding files
rm "$nomefile"_uniform.pdf

# EXIT
exit 0

	Trimming a PDF online with LaTeX: new feature added
	How to stabilize a video using FFmpeg and vid.stab
	Script to adapt the PDF size
	How to shuffle slyly a list of file with a bash script
	An improved bash script to shuffle files and the MP3 tags too
	Script to add bookmarks and toc in PDFs

PDF2CBZ for a better legibility on eReader

The "definitive" solution

References

Add new comment

Plain text

Add new comment

Plain text

engineering

geotechnics

hydraulic structures

pdf

programming

software

web

work in progress

Nicola Rainiero

Cerca

PDF2CBZ for a better legibility on eReader

The "definitive" solution

References

Add new comment

Plain text

Add new comment

Plain text

Share This Page

Nicola Rainiero