PDF2CBZ for a better legibility on eReader
I admit it: my ebook reader has many problem in loading some type of PDFs, though I had made various tricks to fix this problem, like my site can testify! In last days I have tried a new approach, the conversion to CBZ, a type of archive file for the purpose of sequential viewing of images especially comic books. And for now, it works!
Last week was terrible for my eyes, I wasn't able to read any numbers of my newspaper, Il Fatto Quotidiano, using my Asus DR900 eReader. I tested with my script and the two original versions available online. Nothing to do! The device would block when it loads the file. Also my netbook had loadings more slow, above all in opening of these PDFs.
The situation was unbearable for me, I picked up again my set_uniform_pagination.sh script and I tried to fix this problem for ever.
The "definitive" solution
My initial idea was a double conversion:
- PDF uniformed --> images
- images --> PDF
So I tested some solutions found on internet, employing both ImageMagick and Ghostscript (the second is the best for quality and low hardware requirements). But in the second conversion, the resulting PDF had some problem with the text quality, as you can see in the pictures below.
Original PDF Size: 45.5 MB | PDF form JPG images Size: 11.6 MB | PDF from PNG images Size: 18.1 MB (the output seems good, but in the reader the text is partially hidden) |
I put the better commands (you can remove the "#" character, to test them) in this section of the script:
####################################################################### # FIRST SOLUTION (POOR OVERALL QUALITY) # convert all the PDF pages to various image formats and convert again to PDF #
If you have any suggestions, please tell me!
I didn't like any PDF outputs, but I had good images with a problem: my eReader doesn't read them! How to use them? It's simple as a comic book archive, ie a series of image files, typically PNG (lossless compression) or JPEG (lossy compression) files, stored as a single archive file.
So I elaborate the second solution:
- extraction of the images from PDF (formatting them in grey scale, but I insert in the script a link with other available outputs);
- archive them using zip archive type;
- rename ZIP file in CBZ.
It isn't anymore a PDF, but it works very well, without exhausting loadings and slowly changing pages.
Conversion PNG to CBZ Size: 21.7 MB (the final output is in grey scale, because my eReader is colored in black and white) |
Here it is the pdf2cbz_with_uniform_pagination.sh script and below the bash code:
#!/bin/bash # Script written by Nicola Rainiero # Available at http://rainnic.altervista.org # # This work is licensed under the Creative Commons Attribution 3.0 Italy License. # To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/it/ # # Requirements: pdfinfo, awk, LaTex with pdfpages and ifthen packages,imagemagick, ghostcript, zip # Usage: pdf2cbz_with_uniform-pagination.sh INPUT_FILE.pdf # # If you don't have the permission to execute it, run in a terminal: # chmod +x pdf2cbz_with_uniform-pagination.sh # if [ -n "$1" ] then document=$1 # check if exist an input PDF file else echo Missing input PDF 'file'!! exit 0 fi echo $document # read the exact number of page in the PDF file and write it in "pagine" variable echo `pdfinfo $document | awk ' $1=="Pages:" {print $2}'` > input.txt pagine=$(cat input.txt | awk '{ SUM += $1} END { print SUM }') echo "The pages of this document are $pagine" echo '% File di conversione' > latex.tex # initialize the latex document: the default page layout is "portrait" # to have the whole document pages changed to "landscape" echo '\documentclass[a4paper,portrait]{minimal}' >> latex.tex; echo '\usepackage[pdftex,portrait]{geometry}' >> latex.tex; echo '\usepackage{pdfpages}' >> latex.tex; echo '\usepackage{ifthen}' >> latex.tex; echo '\newcounter{pg}' >> latex.tex; echo '\begin{document}' >> latex.tex; # read the horizontal dimension of the first page ("-f 1" option) and save it in: rifh echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt readH=$(cat input.txt | awk '{ SUM += $1} END { print SUM }') echo "The width of the first page is $readH pt" rifh=$( echo "($readH+0.5)/1" | bc ) echo "(Round it to $rifh)" # read the vertical dimension of the first page ("-f 1" option) and save it in: rifv echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $5}'` > input.txt readV=$(cat input.txt | awk '{ SUM += $1} END { print SUM }') echo "The height of the first page is $readV pt" rifv=$( echo "($readV+0.5)/1" | bc ) echo "(Round it to $rifv)" echo "----------------" # check for every page the corresponding horizontal dimension # and compare it with the "rifh" variable for i in `seq 1 $pagine` do # # echo `pdfinfo -f $i -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt # removed because it gives the following error: # "Command Line Error: Wrong page range given: the first page ("selected page") can not # be after the last page (previous page)" # # The new command works better: echo `pdfinfo -l $pagine -box $document | awk '$2=='"$i"' && $3=="MediaBox:" {print $6}'` > input.txt h=$(cat input.txt | awk '{ SUM += $1} END { print SUM }') echo "For the page $i the width is $h pt" h=$( echo "($h+0.5)/1" | bc ) echo "(Round it to $h)" if [[ "$h" -gt "$rifh+200" ]] then echo 'split this' page echo ' \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex; echo ' \includepdf[pages='$i',viewport='$rifh' 0 '$h' '$rifv']{'$document'} ' >> latex.tex; else echo 'do' not 'split this' page echo ' \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex; fi done # close the latex document and make pdf --> latex.pdf echo '\end{document} ' >> latex.tex; pdflatex latex.tex # save in "nomefile" variable the exact name of the input file nomefile=${1%%.*} echo $nomefile # rename the output pdf file mv latex.pdf "$nomefile"_uniform.pdf # clean the exceding files for LaTeX rm input.txt rm latex* ####################################################################### # FIRST SOLUTION (PROBLEM WITH TEXT AND OVERALL QUALITY) # convert all the PDF pages to various image formats and convert again to PDF # # Using Imagemagick JPEG/PNG/TIFF # compression level and quality: # http://www.imagemagick.org/script/command-line-options.php#quality # # Stage 1: PDF to images # ## Poor quality for vector graphics #convert *.png "$nomefile"_convert.pdf #convert -density 288 "$nomefile"_uniform.pdf -resize 25% pag_%02d.png ## Good compromise quality/weight of the images #convert -density 200x200 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.png ## Good compromise quality/weight of the images, massive use of RAM memory during conversion #convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.tiff ##Only slim files, but poor quality: #convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.jpg # Stage 2: images to PDF # #convert *.png -quality 90 -set units PixelsPerInch "$nomefile"_convert.pdf #convert *.tiff -set units PixelsPerInch "$nomefile"_convert.pdf #convert *.jpg -quality 75 -set units PixelsPerInch "$nomefile"_convert.pdf ##optimize "$nomefile"_convert.pdf and rename it in "nomefile" plus the ebook label #gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$nomefile"_ebook.pdf "$nomefile"_convert.pdf ##clean the exceding files #rm "$nomefile"_convert.pdf #rm "$nomefile"_uniform.pdf #rm pag_* ####################################################################### # SECOND AND "DEFINITIVE" SOLUTION # convert all the PDF pages to various image formats and convert to PDF # ##Using Ghostscript (BEST RESULTS) # list of output devices available: # http://pages.cs.wisc.edu/~ghost/doc/AFPL/devices.htm # ##Good quality for raster graphics (jpeg --> JPEG format with RGB output): #gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=jpeggray -r300x300 -dJPEGQ=100 -sOutputFile=pag_%02d.jpg "$nomefile"_uniform.pdf ##Best quality for vector graphics (png16m --> PNG format with 24-bit color output, pnggray): gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pnggray -r300x300 -sOutputFile=pag_%02d.png "$nomefile"_uniform.pdf ##zip, rename to cbz and clean images #zip "$nomefile" pag_* # it doesn't work if the pages exceed the hundreds zip "$nomefile" pag_* | sort -n -t _ -k 2 # it works better! mv "$nomefile".zip "$nomefile".cbz rm pag_* ##clean the exceding files rm "$nomefile"_uniform.pdf # EXIT exit 0
Add new comment