Da PDF a CBZ per la leggibilità su eReader

La soluzione "definitiva"

L'idea che avevo in mente era una doppia conversione:

PDF uniformato --> immagini
immagini --> PDF funzionante

Quindi ho cominciato a testare quanto trovato nei forum, usando sia il pacchetto ImageMagick sia Ghostscript stesso (senza ombra di dubbio con migliori risultati qualitativi e minore tempo e risorse hardware impiegate). Ma nel secondo passaggio il PDF risultava poco leggibile perché i testi a seconda delle impostazioni e formati scelti o erano sgranati o parzialmente cancellati.


PDF originale Dimensione: 45,5 MB	PDF da immagini convertite in JPG Dimensione: 11,6 MB	PDF da immagini convertite in PNG Dimensione: 18,1 MB (la resa a video sembra molto buona, ma nel reader il testo perde definizione)

Ho comunque conservato nello script i migliori comandi nella sezione:

#######################################################################
# FIRST SOLUTION (POOR OVERALL QUALITY)
# convert all the PDF pages to various image formats and convert again to PDF
#

Inserendo opportune descrizioni e lasciandoli commentati, se qualcuno vuole proporre correzioni e miglioramenti non abbia esitazioni!

Avevo quasi deciso di desistere e mi ritrovavo con una serie di immagini di buona qualità ma incompatibili con il mio lettore... quando mio fratello indirettamente mi ha fornito lo spunto per risolvere questa fastidiosa limitazione: il Comic book archive. Questo formato non è altro che un file compresso in zip contenente una serie di immagini e rinominato in CBZ.

Quindi la seconda soluzione si occupa di:

estrarre le immagini dal PDF uniformato (le ho convertite in scala di grigi, ma ho messo il link con gli altri output disponibili);
comprimerle in zip;
rinominare il file risultante in CBZ.

Non sarà più un PDF ma funziona e finalmente riesco anche a leggere l'inserto satirico della domenica!

Conversione in CBZ da PNG
Dimensione: 21,7 MB
(ho scelto l'output in scala di grigi perché il lettore è in bianco e nero!)

Qui c'è lo script pdf2cbz_with_uniform_pagination.sh e per i più pigri ecco il listato:

#!/bin/bash
# Script written by Nicola Rainiero
# Available at http://rainnic.altervista.org
#
# This work is licensed under the Creative Commons Attribution 3.0 Italy License.
# To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/it/
#
# Requirements: pdfinfo, awk, LaTex with pdfpages and ifthen packages,imagemagick, ghostcript, zip
# Usage: pdf2cbz_with_uniform-pagination.sh INPUT_FILE.pdf
#
# If you don't have the permission to execute it, run in a terminal:
# chmod +x pdf2cbz_with_uniform-pagination.sh
#

if [ -n "$1" ]
then
	document=$1 # check if exist an input PDF file
else
	echo Missing input PDF 'file'!!
	exit 0
fi

echo $document
# read the exact number of page in the PDF file and write it in "pagine" variable
echo `pdfinfo $document | awk ' $1=="Pages:" {print $2}'` > input.txt 
pagine=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')

echo "The pages of this document are $pagine"
echo '% File di conversione' > latex.tex
# initialize the latex document: the default page layout is "portrait"
# to have the whole document pages changed to "landscape"
echo '\documentclass[a4paper,portrait]{minimal}' >> latex.tex;
echo '\usepackage[pdftex,portrait]{geometry}' >> latex.tex;
echo '\usepackage{pdfpages}' >> latex.tex;
echo '\usepackage{ifthen}' >> latex.tex;
echo '\newcounter{pg}' >> latex.tex;
echo '\begin{document}' >> latex.tex;

# read the horizontal dimension of the first page ("-f 1" option) and save it in: rifh
echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt
readH=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
echo "The width of the first page is $readH pt"
rifh=$( echo "($readH+0.5)/1" | bc )
echo "(Round it to $rifh)"
# read the vertical dimension of the first page ("-f 1" option) and save it in: rifv
echo `pdfinfo -f 1 -box $document | awk ' $1=="MediaBox:" {print $5}'` > input.txt
readV=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
echo "The height of the first page is $readV pt"
rifv=$( echo "($readV+0.5)/1" | bc )
echo "(Round it to $rifv)"
echo "----------------"

# check for every page the corresponding horizontal dimension
# and compare it with the "rifh" variable
for i in `seq 1 $pagine`
do
   #
   # echo `pdfinfo -f $i -box $document | awk ' $1=="MediaBox:" {print $4}'` > input.txt
   # removed because it gives the following error:
   # "Command Line Error: Wrong page range given: the first page ("selected page") can not
   # be after the last page (previous page)"
   #
   # The new command works better:
   echo `pdfinfo -l $pagine -box $document | awk '$2=='"$i"' && $3=="MediaBox:" {print $6}'` > input.txt
   h=$(cat input.txt | awk '{ SUM += $1} END { print SUM }')
    echo "For the page $i the width is $h pt"
    h=$( echo "($h+0.5)/1" | bc )
    echo "(Round it to $h)"
    if [[ "$h" -gt "$rifh+200" ]]
	then
		echo 'split this' page
		echo '   \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex;
		echo '   \includepdf[pages='$i',viewport='$rifh' 0 '$h' '$rifv']{'$document'} ' >> latex.tex;
	else
		echo 'do' not 'split this' page
		echo '   \includepdf[pages='$i',viewport=0 0 '$rifh' '$rifv']{'$document'} ' >> latex.tex;
	fi
done

# close the latex document and make pdf --> latex.pdf
echo '\end{document} ' >> latex.tex;
pdflatex latex.tex 

# save in "nomefile" variable the exact name of the input file
nomefile=${1%%.*}
echo $nomefile

# rename the output pdf file
mv latex.pdf "$nomefile"_uniform.pdf

# clean the exceding files for LaTeX
rm input.txt
rm latex*

#######################################################################
# FIRST SOLUTION (PROBLEM WITH TEXT AND OVERALL QUALITY)
# convert all the PDF pages to various image formats and convert again to PDF
#

# Using Imagemagick JPEG/PNG/TIFF
# compression level and quality:
# http://www.imagemagick.org/script/command-line-options.php#quality
#

# Stage 1: PDF to images
#
## Poor quality for vector graphics
#convert *.png "$nomefile"_convert.pdf
#convert -density 288 "$nomefile"_uniform.pdf -resize 25% pag_%02d.png
## Good compromise quality/weight of the images
#convert -density 200x200 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.png
## Good compromise quality/weight of the images, massive use of RAM memory during conversion
#convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.tiff
##Only slim files, but poor quality:
#convert -density 300x300 "$nomefile"_uniform.pdf -units PixelsPerInch pag_%02d.jpg

# Stage 2: images to PDF
#
#convert *.png -quality 90 -set units PixelsPerInch "$nomefile"_convert.pdf
#convert *.tiff -set units PixelsPerInch "$nomefile"_convert.pdf
#convert *.jpg -quality 75 -set units PixelsPerInch "$nomefile"_convert.pdf

##optimize "$nomefile"_convert.pdf and rename it in "nomefile" plus the ebook label
#gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$nomefile"_ebook.pdf "$nomefile"_convert.pdf

##clean the exceding files
#rm "$nomefile"_convert.pdf
#rm "$nomefile"_uniform.pdf
#rm pag_*

#######################################################################
# SECOND AND "DEFINITIVE" SOLUTION
# convert all the PDF pages to various image formats and convert to PDF
#

##Using Ghostscript (BEST RESULTS)
# list of output devices available:
# http://pages.cs.wisc.edu/~ghost/doc/AFPL/devices.htm
#
##Good quality for raster graphics (jpeg --> JPEG format with RGB output):
#gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=jpeggray -r300x300 -dJPEGQ=100 -sOutputFile=pag_%02d.jpg "$nomefile"_uniform.pdf
##Best quality for vector graphics (png16m --> PNG format with 24-bit color output, pnggray):
gs -dBATCH -dNOPAUSE -dSAFER -sDEVICE=pnggray -r300x300 -sOutputFile=pag_%02d.png "$nomefile"_uniform.pdf

##zip, rename to cbz and clean images
#zip "$nomefile" pag_* # it doesn't work if the pages exceed the hundreds
zip "$nomefile" pag_* | sort -n -t _ -k 2 # it works better!
mv  "$nomefile".zip "$nomefile".cbz
rm pag_*

##clean the exceding files
rm "$nomefile"_uniform.pdf

# EXIT
exit 0

	Aggiunta una nuova funzione per il ritaglio online di un PDF con LaTeX
	Come stabilizzare un video usando FFmpeg e vid.stab
	Script per uniformare i PDF
	Come mescolare efficacemente un elenco di file con uno script bash
	Uno script bash migliorato per mescolare file e anche i tag MP3
	Script per dotare i PDF di segnalibri e indici

Da PDF a CBZ per la leggibilità su eReader

La soluzione "definitiva"

Fonti

Aggiungi un commento

Plain text

Aggiungi un commento

Plain text

costruzioni idrauliche

engineering

geotechnics

geotecnica

hydraulic structures

ingegneria

pdf

programming

software

web

work in progress

Nicola Rainiero

Cerca

Da PDF a CBZ per la leggibilità su eReader

La soluzione "definitiva"

Fonti

Aggiungi un commento

Plain text

Aggiungi un commento

Plain text

Share This Page

Nicola Rainiero