How to reduce the size of merged PDF/A-1b files with pdfbox or other java library












1















Input: A list of (e.g. 14) PDF/A-1b files with embedded fonts.
Processing: Doing a simple merge with Apache PDFBOX.
Result: 1 PDF/A-1b file with large (too large) file size. (It is almost the sum of the size of all the source files).



Question: Is there a way to reduce the file size of the resulting PDF?
Idea: Remove redundant embedded fonts. But how to? And is it the right way to do?



Unfortunately the following code is not doing the job, but is highlighting the obvious problem.



try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
List<COSName> collectedFonts = new ArrayList<>();
PDPageTree pages = document.getDocumentCatalog().getPages();
int pageNr = 0;
for (PDPage page : pages) {
pageNr++;
Iterable<COSName> names = page.getResources().getFontNames();
System.out.println("Page " + pageNr);
for (COSName name : names) {
collectedFonts.add(name);
System.out.print("t" + name + " - ");
PDFont font = page.getResources().getFont(name);
System.out.println(font + ", embedded: " + font.isEmbedded());
page.getCOSObject().removeItem(COSName.F);
page.getResources().getCOSObject().removeItem(name);
}
}
document.save("E:/tmp/output.pdf");
}


The code produces an output like that:



Page 1
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true


Any help appreciated ...










share|improve this question

























  • Are the fonts embedded fully? Or as subsets?

    – mkl
    Nov 22 '18 at 6:58











  • @mkl from the output it looks as if they're fully embedded. So if the files are all from the same source, and have the same dictionary, then one could really replace the objects in the fonts resources.

    – Tilman Hausherr
    Nov 22 '18 at 7:28











  • After merging make sure to check the result file with preflight to be sure it is still PDF/A. I remember I had a problem years ago involving multiple output intents.

    – Tilman Hausherr
    Nov 22 '18 at 7:30











  • @mkl like Tilman gessed, I am almost quite sure that they are embedded fully

    – hab
    Nov 22 '18 at 8:37






  • 1





    Ok, the fonts indeed are completely embedded. And identically. Files like that can be optimized without too much effort. I'll try and find some time for a working answer.

    – mkl
    Nov 22 '18 at 11:16
















1















Input: A list of (e.g. 14) PDF/A-1b files with embedded fonts.
Processing: Doing a simple merge with Apache PDFBOX.
Result: 1 PDF/A-1b file with large (too large) file size. (It is almost the sum of the size of all the source files).



Question: Is there a way to reduce the file size of the resulting PDF?
Idea: Remove redundant embedded fonts. But how to? And is it the right way to do?



Unfortunately the following code is not doing the job, but is highlighting the obvious problem.



try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
List<COSName> collectedFonts = new ArrayList<>();
PDPageTree pages = document.getDocumentCatalog().getPages();
int pageNr = 0;
for (PDPage page : pages) {
pageNr++;
Iterable<COSName> names = page.getResources().getFontNames();
System.out.println("Page " + pageNr);
for (COSName name : names) {
collectedFonts.add(name);
System.out.print("t" + name + " - ");
PDFont font = page.getResources().getFont(name);
System.out.println(font + ", embedded: " + font.isEmbedded());
page.getCOSObject().removeItem(COSName.F);
page.getResources().getCOSObject().removeItem(name);
}
}
document.save("E:/tmp/output.pdf");
}


The code produces an output like that:



Page 1
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true


Any help appreciated ...










share|improve this question

























  • Are the fonts embedded fully? Or as subsets?

    – mkl
    Nov 22 '18 at 6:58











  • @mkl from the output it looks as if they're fully embedded. So if the files are all from the same source, and have the same dictionary, then one could really replace the objects in the fonts resources.

    – Tilman Hausherr
    Nov 22 '18 at 7:28











  • After merging make sure to check the result file with preflight to be sure it is still PDF/A. I remember I had a problem years ago involving multiple output intents.

    – Tilman Hausherr
    Nov 22 '18 at 7:30











  • @mkl like Tilman gessed, I am almost quite sure that they are embedded fully

    – hab
    Nov 22 '18 at 8:37






  • 1





    Ok, the fonts indeed are completely embedded. And identically. Files like that can be optimized without too much effort. I'll try and find some time for a working answer.

    – mkl
    Nov 22 '18 at 11:16














1












1








1


3






Input: A list of (e.g. 14) PDF/A-1b files with embedded fonts.
Processing: Doing a simple merge with Apache PDFBOX.
Result: 1 PDF/A-1b file with large (too large) file size. (It is almost the sum of the size of all the source files).



Question: Is there a way to reduce the file size of the resulting PDF?
Idea: Remove redundant embedded fonts. But how to? And is it the right way to do?



Unfortunately the following code is not doing the job, but is highlighting the obvious problem.



try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
List<COSName> collectedFonts = new ArrayList<>();
PDPageTree pages = document.getDocumentCatalog().getPages();
int pageNr = 0;
for (PDPage page : pages) {
pageNr++;
Iterable<COSName> names = page.getResources().getFontNames();
System.out.println("Page " + pageNr);
for (COSName name : names) {
collectedFonts.add(name);
System.out.print("t" + name + " - ");
PDFont font = page.getResources().getFont(name);
System.out.println(font + ", embedded: " + font.isEmbedded());
page.getCOSObject().removeItem(COSName.F);
page.getResources().getCOSObject().removeItem(name);
}
}
document.save("E:/tmp/output.pdf");
}


The code produces an output like that:



Page 1
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true


Any help appreciated ...










share|improve this question
















Input: A list of (e.g. 14) PDF/A-1b files with embedded fonts.
Processing: Doing a simple merge with Apache PDFBOX.
Result: 1 PDF/A-1b file with large (too large) file size. (It is almost the sum of the size of all the source files).



Question: Is there a way to reduce the file size of the resulting PDF?
Idea: Remove redundant embedded fonts. But how to? And is it the right way to do?



Unfortunately the following code is not doing the job, but is highlighting the obvious problem.



try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
List<COSName> collectedFonts = new ArrayList<>();
PDPageTree pages = document.getDocumentCatalog().getPages();
int pageNr = 0;
for (PDPage page : pages) {
pageNr++;
Iterable<COSName> names = page.getResources().getFontNames();
System.out.println("Page " + pageNr);
for (COSName name : names) {
collectedFonts.add(name);
System.out.print("t" + name + " - ");
PDFont font = page.getResources().getFont(name);
System.out.println(font + ", embedded: " + font.isEmbedded());
page.getCOSObject().removeItem(COSName.F);
page.getResources().getCOSObject().removeItem(name);
}
}
document.save("E:/tmp/output.pdf");
}


The code produces an output like that:



Page 1
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true


Any help appreciated ...







java pdf fonts pdfbox filesize






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 30 '18 at 9:49









mkl

55.2k1170149




55.2k1170149










asked Nov 21 '18 at 20:55









habhab

738




738













  • Are the fonts embedded fully? Or as subsets?

    – mkl
    Nov 22 '18 at 6:58











  • @mkl from the output it looks as if they're fully embedded. So if the files are all from the same source, and have the same dictionary, then one could really replace the objects in the fonts resources.

    – Tilman Hausherr
    Nov 22 '18 at 7:28











  • After merging make sure to check the result file with preflight to be sure it is still PDF/A. I remember I had a problem years ago involving multiple output intents.

    – Tilman Hausherr
    Nov 22 '18 at 7:30











  • @mkl like Tilman gessed, I am almost quite sure that they are embedded fully

    – hab
    Nov 22 '18 at 8:37






  • 1





    Ok, the fonts indeed are completely embedded. And identically. Files like that can be optimized without too much effort. I'll try and find some time for a working answer.

    – mkl
    Nov 22 '18 at 11:16



















  • Are the fonts embedded fully? Or as subsets?

    – mkl
    Nov 22 '18 at 6:58











  • @mkl from the output it looks as if they're fully embedded. So if the files are all from the same source, and have the same dictionary, then one could really replace the objects in the fonts resources.

    – Tilman Hausherr
    Nov 22 '18 at 7:28











  • After merging make sure to check the result file with preflight to be sure it is still PDF/A. I remember I had a problem years ago involving multiple output intents.

    – Tilman Hausherr
    Nov 22 '18 at 7:30











  • @mkl like Tilman gessed, I am almost quite sure that they are embedded fully

    – hab
    Nov 22 '18 at 8:37






  • 1





    Ok, the fonts indeed are completely embedded. And identically. Files like that can be optimized without too much effort. I'll try and find some time for a working answer.

    – mkl
    Nov 22 '18 at 11:16

















Are the fonts embedded fully? Or as subsets?

– mkl
Nov 22 '18 at 6:58





Are the fonts embedded fully? Or as subsets?

– mkl
Nov 22 '18 at 6:58













@mkl from the output it looks as if they're fully embedded. So if the files are all from the same source, and have the same dictionary, then one could really replace the objects in the fonts resources.

– Tilman Hausherr
Nov 22 '18 at 7:28





@mkl from the output it looks as if they're fully embedded. So if the files are all from the same source, and have the same dictionary, then one could really replace the objects in the fonts resources.

– Tilman Hausherr
Nov 22 '18 at 7:28













After merging make sure to check the result file with preflight to be sure it is still PDF/A. I remember I had a problem years ago involving multiple output intents.

– Tilman Hausherr
Nov 22 '18 at 7:30





After merging make sure to check the result file with preflight to be sure it is still PDF/A. I remember I had a problem years ago involving multiple output intents.

– Tilman Hausherr
Nov 22 '18 at 7:30













@mkl like Tilman gessed, I am almost quite sure that they are embedded fully

– hab
Nov 22 '18 at 8:37





@mkl like Tilman gessed, I am almost quite sure that they are embedded fully

– hab
Nov 22 '18 at 8:37




1




1





Ok, the fonts indeed are completely embedded. And identically. Files like that can be optimized without too much effort. I'll try and find some time for a working answer.

– mkl
Nov 22 '18 at 11:16





Ok, the fonts indeed are completely embedded. And identically. Files like that can be optimized without too much effort. I'll try and find some time for a working answer.

– mkl
Nov 22 '18 at 11:16












3 Answers
3






active

oldest

votes


















3














When debugging in the file, I recognized that the font files for the same fonts were referenced several times. So replacing the actual font file item in the dictionary with an already viewed font file item, the reference was removed and compression could be done. By that, I was able to shrink a 30 MB File to around 6 MB.



    File file = new File("test.pdf");

PDDocument doc = PDDocument.load(file);
Map<String, COSBase> fontFileCache = new HashMap<>();
for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
final PDPage page = doc.getPage(pageNumber);
COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
for (COSName currentFont : pageDictionary.keySet()) {
COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
for (COSName actualFont : fontDictionary.keySet()) {
COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
if (actualFontDictionaryObject instanceof COSDictionary) {
COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
}
}
}
}
}

final ByteArrayOutputStream baos = new ByteArrayOutputStream();
doc.save(baos);
final File compressed = new File("test_compressed.pdf");
baos.writeTo(new FileOutputStream(compressed));


Maybe this is not the most elegant way to do that, but it works and keeps the PDF/A-1b compatibility.






share|improve this answer
























  • This only works if a all font programs embedded for the same name indeed are identical, and if b all fonts to consider are in the immediate page resources, not the resources of some referred to xobject or pattern. If those conditions are fulfilled, though, it very likely is much faster than the approach in my answer.

    – mkl
    Nov 30 '18 at 6:03






  • 2





    I just compared the results for the example PDF you shared. Original PDF size: 6561805. Result size, your code: 788470. Result size, my code: 691147. Thus, even though there still is some more optimization potential, your shorter and faster code does remove the major duplicates, too.

    – mkl
    Nov 30 '18 at 9:33











  • Thanks for your investigations! Is the file with your code still PDF/A-1b compatible?

    – schowave
    Nov 30 '18 at 9:40













  • "Is the file with your code still PDF/A-1b compatible?" - in case of your example file Adobe Acrobat 9.5 Preflight says it is.

    – mkl
    Nov 30 '18 at 9:55











  • Your solution worked for me, thank you @schowave

    – hab
    Dec 5 '18 at 19:58



















4














The code in this answer is an attempt to optimize documents like the OP's example document, i.e. documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. It does not merge merely nearly identical objects, e.g. multiple subsets of the same font into one single union subset.



In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates.



As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash.



To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. To do this more effectively, the code initially not only collects all complex objects but also all references to each of them.



The optimization code



This is the method to call to optimize a PDDocument:



public void optimize(PDDocument pdDocument) throws IOException {
Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
for (int pass = 0; ; pass++) {
int merges = mergeDuplicates(complexObjects);
if (merges <= 0) {
System.out.printf("Pass %d - No merged objectsnn", pass);
break;
}
System.out.printf("Pass %d - Merged objects: %dnn", pass, merges);
}
}


(OptimizeAfterMerge method under test)



The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged.



The following helper methods and classes collect the complex objects of a PDF and the references to each of them:



Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
incomingReferences.put(catalogDictionary, new ArrayList<>());

Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
Set<COSBase> thisPass = new HashSet<>();
while(!lastPass.isEmpty()) {
for (COSBase object : lastPass) {
if (object instanceof COSArray) {
COSArray array = (COSArray) object;
for (int i = 0; i < array.size(); i++) {
addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
}
} else if (object instanceof COSDictionary) {
COSDictionary dictionary = (COSDictionary) object;
for (COSName key : dictionary.keySet()) {
addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
}
}
}
lastPass = thisPass;
thisPass = new HashSet<>();
}
return incomingReferences;
}

void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
COSBase object = reference.getTo();
if (object instanceof COSArray || object instanceof COSDictionary) {
Collection<Reference> incoming = incomingReferences.get(object);
if (incoming == null) {
incoming = new ArrayList<>();
incomingReferences.put(object, incoming);
thisPass.add(object);
}
incoming.add(reference);
}
}


(OptimizeAfterMerge helper methods findComplexObjects and addTarget)



interface Reference {
public COSBase getFrom();

public COSBase getTo();
public void setTo(COSBase to);
}

static class ArrayReference implements Reference {
public ArrayReference(COSArray array, int index) {
this.from = array;
this.index = index;
}

@Override
public COSBase getFrom() {
return from;
}

@Override
public COSBase getTo() {
return resolve(from.get(index));
}

@Override
public void setTo(COSBase to) {
from.set(index, to);
}

final COSArray from;
final int index;
}

static class DictionaryReference implements Reference {
public DictionaryReference(COSDictionary dictionary, COSName key) {
this.from = dictionary;
this.key = key;
}

@Override
public COSBase getFrom() {
return from;
}

@Override
public COSBase getTo() {
return resolve(from.getDictionaryObject(key));
}

@Override
public void setTo(COSBase to) {
from.setItem(key, to);
}

final COSDictionary from;
final COSName key;
}


(OptimizeAfterMerge helper interface Reference with implementations ArrayReference and DictionaryReference)



And the following helper methods and classes finally identify and merge duplicates:



int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
for (COSBase object : complexObjects.keySet()) {
hashes.add(new HashOfCOSBase(object));
}
Collections.sort(hashes);

int removedDuplicates = 0;
if (!hashes.isEmpty()) {
int runStart = 0;
int runHash = hashes.get(0).hash;
for (int i = 1; i < hashes.size(); i++) {
int hash = hashes.get(i).hash;
if (hash != runHash) {
int runSize = i - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
}
runHash = hash;
runStart = i;
}
}
int runSize = hashes.size() - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
}
}
return removedDuplicates;
}

int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
int removedDuplicates = 0;

List<List<COSBase>> duplicateSets = new ArrayList<>();
for (HashOfCOSBase entry : run) {
COSBase element = entry.object;
for (List<COSBase> duplicateSet : duplicateSets) {
if (equals(element, duplicateSet.get(0))) {
duplicateSet.add(element);
element = null;
break;
}
}
if (element != null) {
List<COSBase> duplicateSet = new ArrayList<>();
duplicateSet.add(element);
duplicateSets.add(duplicateSet);
}
}

System.out.printf("Identified %d set(s) of identical objects in run.n", duplicateSets.size());

for (List<COSBase> duplicateSet : duplicateSets) {
if (duplicateSet.size() > 1) {
COSBase surviver = duplicateSet.remove(0);
Collection<Reference> surviverReferences = complexObjects.get(surviver);
for (COSBase object : duplicateSet) {
Collection<Reference> references = complexObjects.get(object);
for (Reference reference : references) {
reference.setTo(surviver);
surviverReferences.add(reference);
}
complexObjects.remove(object);
removedDuplicates++;
}
surviver.setDirect(false);
}
}

return removedDuplicates;
}

boolean equals(COSBase a, COSBase b) {
if (a instanceof COSArray) {
if (b instanceof COSArray) {
COSArray aArray = (COSArray) a;
COSArray bArray = (COSArray) b;
if (aArray.size() == bArray.size()) {
for (int i=0; i < aArray.size(); i++) {
if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
return false;
}
return true;
}
}
} else if (a instanceof COSDictionary) {
if (b instanceof COSDictionary) {
COSDictionary aDict = (COSDictionary) a;
COSDictionary bDict = (COSDictionary) b;
Set<COSName> keys = aDict.keySet();
if (keys.equals(bDict.keySet())) {
for (COSName key : keys) {
if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
return false;
}
// In case of COSStreams we strictly speaking should
// also compare the stream contents here. But apparently
// their hashes coincide well enough for the original
// hashing equality, so let's just assume...
return true;
}
}
}
return false;
}

static COSBase resolve(COSBase object) {
while (object instanceof COSObject)
object = ((COSObject)object).getObject();
return object;
}


(OptimizeAfterMerge helper methods mergeDuplicates, mergeRun, equals, and resolve)



static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
public HashOfCOSBase(COSBase object) throws IOException {
this.object = object;
this.hash = calculateHash(object);
}

int calculateHash(COSBase object) throws IOException {
if (object instanceof COSArray) {
int result = 1;
for (COSBase member : (COSArray)object)
result = 31 * result + member.hashCode();
return result;
} else if (object instanceof COSDictionary) {
int result = 3;
for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
result += entry.hashCode();
if (object instanceof COSStream) {
try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
MessageDigest md = MessageDigest.getInstance("MD5");
byte buffer = new byte[8192];
int bytesRead = 0;
while((bytesRead = data.read(buffer)) >= 0)
md.update(buffer, 0, bytesRead);
result = 31 * result + Arrays.hashCode(md.digest());
} catch (NoSuchAlgorithmException e) {
throw new IOException(e);
}
}
return result;
} else {
throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
}
}

final COSBase object;
final int hash;

@Override
public int compareTo(HashOfCOSBase o) {
int result = Integer.compare(hash, o.hash);
if (result == 0)
result = Integer.compare(hashCode(), o.hashCode());
return result;
}
}


(OptimizeAfterMerge helper class HashOfCOSBase)



Applying the code to the OP's example document



The OP's example document is about 6.5 MB in size. Applying the above code like this



PDDocument pdDocument = PDDocument.load(SOURCE);

optimize(pdDocument);

pdDocument.save(RESULT);


results in a PDF less than 700 KB in size, and it appears to be complete.



(If something's missing, please tell, I'll try and fix that.)



Words of warning



On one hand this optimizer will not recognize all identical duplicates. In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles.



On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.



Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure (PDDocument, PDDocumentCatalog, PDAcroForm, ...). To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument instances and save it soon after without further ado.






share|improve this answer


























  • Thank you, the code seems do do the job, too. As I was forced to provide a solution very quickly last week I chose to adopt the solution of schowave, as his view lines solution worked for me and my PDFs. BTW: In the meantime I found a working solution usint ITEXT 7, I'm going post a code example tomorrow.

    – hab
    Dec 5 '18 at 20:14



















1














An other way I found is using ITEXT 7 that way (pdfWriter.setSmartMode):



    try (PdfWriter pdfWriter = new PdfWriter(out)) {
pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
PdfMerger merger = new PdfMerger(pdfDoc);
merger.setCloseSourceDocuments(true);
try {
for (InputStream pdf : pdfs) {
try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
merger.merge(doc, createPageList(doc.getNumberOfPages()));
}
}
merger.close();
}
catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
e);
}
catch (com.itextpdf.io.IOException | PdfException e) {
throw new BieneException(e.getMessage(), e);
}
}
}





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420344%2fhow-to-reduce-the-size-of-merged-pdf-a-1b-files-with-pdfbox-or-other-java-librar%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    When debugging in the file, I recognized that the font files for the same fonts were referenced several times. So replacing the actual font file item in the dictionary with an already viewed font file item, the reference was removed and compression could be done. By that, I was able to shrink a 30 MB File to around 6 MB.



        File file = new File("test.pdf");

    PDDocument doc = PDDocument.load(file);
    Map<String, COSBase> fontFileCache = new HashMap<>();
    for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
    final PDPage page = doc.getPage(pageNumber);
    COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
    for (COSName currentFont : pageDictionary.keySet()) {
    COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
    for (COSName actualFont : fontDictionary.keySet()) {
    COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
    if (actualFontDictionaryObject instanceof COSDictionary) {
    COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
    if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
    COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
    fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
    fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
    }
    }
    }
    }
    }

    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    doc.save(baos);
    final File compressed = new File("test_compressed.pdf");
    baos.writeTo(new FileOutputStream(compressed));


    Maybe this is not the most elegant way to do that, but it works and keeps the PDF/A-1b compatibility.






    share|improve this answer
























    • This only works if a all font programs embedded for the same name indeed are identical, and if b all fonts to consider are in the immediate page resources, not the resources of some referred to xobject or pattern. If those conditions are fulfilled, though, it very likely is much faster than the approach in my answer.

      – mkl
      Nov 30 '18 at 6:03






    • 2





      I just compared the results for the example PDF you shared. Original PDF size: 6561805. Result size, your code: 788470. Result size, my code: 691147. Thus, even though there still is some more optimization potential, your shorter and faster code does remove the major duplicates, too.

      – mkl
      Nov 30 '18 at 9:33











    • Thanks for your investigations! Is the file with your code still PDF/A-1b compatible?

      – schowave
      Nov 30 '18 at 9:40













    • "Is the file with your code still PDF/A-1b compatible?" - in case of your example file Adobe Acrobat 9.5 Preflight says it is.

      – mkl
      Nov 30 '18 at 9:55











    • Your solution worked for me, thank you @schowave

      – hab
      Dec 5 '18 at 19:58
















    3














    When debugging in the file, I recognized that the font files for the same fonts were referenced several times. So replacing the actual font file item in the dictionary with an already viewed font file item, the reference was removed and compression could be done. By that, I was able to shrink a 30 MB File to around 6 MB.



        File file = new File("test.pdf");

    PDDocument doc = PDDocument.load(file);
    Map<String, COSBase> fontFileCache = new HashMap<>();
    for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
    final PDPage page = doc.getPage(pageNumber);
    COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
    for (COSName currentFont : pageDictionary.keySet()) {
    COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
    for (COSName actualFont : fontDictionary.keySet()) {
    COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
    if (actualFontDictionaryObject instanceof COSDictionary) {
    COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
    if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
    COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
    fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
    fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
    }
    }
    }
    }
    }

    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    doc.save(baos);
    final File compressed = new File("test_compressed.pdf");
    baos.writeTo(new FileOutputStream(compressed));


    Maybe this is not the most elegant way to do that, but it works and keeps the PDF/A-1b compatibility.






    share|improve this answer
























    • This only works if a all font programs embedded for the same name indeed are identical, and if b all fonts to consider are in the immediate page resources, not the resources of some referred to xobject or pattern. If those conditions are fulfilled, though, it very likely is much faster than the approach in my answer.

      – mkl
      Nov 30 '18 at 6:03






    • 2





      I just compared the results for the example PDF you shared. Original PDF size: 6561805. Result size, your code: 788470. Result size, my code: 691147. Thus, even though there still is some more optimization potential, your shorter and faster code does remove the major duplicates, too.

      – mkl
      Nov 30 '18 at 9:33











    • Thanks for your investigations! Is the file with your code still PDF/A-1b compatible?

      – schowave
      Nov 30 '18 at 9:40













    • "Is the file with your code still PDF/A-1b compatible?" - in case of your example file Adobe Acrobat 9.5 Preflight says it is.

      – mkl
      Nov 30 '18 at 9:55











    • Your solution worked for me, thank you @schowave

      – hab
      Dec 5 '18 at 19:58














    3












    3








    3







    When debugging in the file, I recognized that the font files for the same fonts were referenced several times. So replacing the actual font file item in the dictionary with an already viewed font file item, the reference was removed and compression could be done. By that, I was able to shrink a 30 MB File to around 6 MB.



        File file = new File("test.pdf");

    PDDocument doc = PDDocument.load(file);
    Map<String, COSBase> fontFileCache = new HashMap<>();
    for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
    final PDPage page = doc.getPage(pageNumber);
    COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
    for (COSName currentFont : pageDictionary.keySet()) {
    COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
    for (COSName actualFont : fontDictionary.keySet()) {
    COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
    if (actualFontDictionaryObject instanceof COSDictionary) {
    COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
    if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
    COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
    fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
    fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
    }
    }
    }
    }
    }

    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    doc.save(baos);
    final File compressed = new File("test_compressed.pdf");
    baos.writeTo(new FileOutputStream(compressed));


    Maybe this is not the most elegant way to do that, but it works and keeps the PDF/A-1b compatibility.






    share|improve this answer













    When debugging in the file, I recognized that the font files for the same fonts were referenced several times. So replacing the actual font file item in the dictionary with an already viewed font file item, the reference was removed and compression could be done. By that, I was able to shrink a 30 MB File to around 6 MB.



        File file = new File("test.pdf");

    PDDocument doc = PDDocument.load(file);
    Map<String, COSBase> fontFileCache = new HashMap<>();
    for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
    final PDPage page = doc.getPage(pageNumber);
    COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
    for (COSName currentFont : pageDictionary.keySet()) {
    COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
    for (COSName actualFont : fontDictionary.keySet()) {
    COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
    if (actualFontDictionaryObject instanceof COSDictionary) {
    COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
    if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
    COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
    fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
    fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
    }
    }
    }
    }
    }

    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    doc.save(baos);
    final File compressed = new File("test_compressed.pdf");
    baos.writeTo(new FileOutputStream(compressed));


    Maybe this is not the most elegant way to do that, but it works and keeps the PDF/A-1b compatibility.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 30 '18 at 2:40









    schowaveschowave

    867




    867













    • This only works if a all font programs embedded for the same name indeed are identical, and if b all fonts to consider are in the immediate page resources, not the resources of some referred to xobject or pattern. If those conditions are fulfilled, though, it very likely is much faster than the approach in my answer.

      – mkl
      Nov 30 '18 at 6:03






    • 2





      I just compared the results for the example PDF you shared. Original PDF size: 6561805. Result size, your code: 788470. Result size, my code: 691147. Thus, even though there still is some more optimization potential, your shorter and faster code does remove the major duplicates, too.

      – mkl
      Nov 30 '18 at 9:33











    • Thanks for your investigations! Is the file with your code still PDF/A-1b compatible?

      – schowave
      Nov 30 '18 at 9:40













    • "Is the file with your code still PDF/A-1b compatible?" - in case of your example file Adobe Acrobat 9.5 Preflight says it is.

      – mkl
      Nov 30 '18 at 9:55











    • Your solution worked for me, thank you @schowave

      – hab
      Dec 5 '18 at 19:58



















    • This only works if a all font programs embedded for the same name indeed are identical, and if b all fonts to consider are in the immediate page resources, not the resources of some referred to xobject or pattern. If those conditions are fulfilled, though, it very likely is much faster than the approach in my answer.

      – mkl
      Nov 30 '18 at 6:03






    • 2





      I just compared the results for the example PDF you shared. Original PDF size: 6561805. Result size, your code: 788470. Result size, my code: 691147. Thus, even though there still is some more optimization potential, your shorter and faster code does remove the major duplicates, too.

      – mkl
      Nov 30 '18 at 9:33











    • Thanks for your investigations! Is the file with your code still PDF/A-1b compatible?

      – schowave
      Nov 30 '18 at 9:40













    • "Is the file with your code still PDF/A-1b compatible?" - in case of your example file Adobe Acrobat 9.5 Preflight says it is.

      – mkl
      Nov 30 '18 at 9:55











    • Your solution worked for me, thank you @schowave

      – hab
      Dec 5 '18 at 19:58

















    This only works if a all font programs embedded for the same name indeed are identical, and if b all fonts to consider are in the immediate page resources, not the resources of some referred to xobject or pattern. If those conditions are fulfilled, though, it very likely is much faster than the approach in my answer.

    – mkl
    Nov 30 '18 at 6:03





    This only works if a all font programs embedded for the same name indeed are identical, and if b all fonts to consider are in the immediate page resources, not the resources of some referred to xobject or pattern. If those conditions are fulfilled, though, it very likely is much faster than the approach in my answer.

    – mkl
    Nov 30 '18 at 6:03




    2




    2





    I just compared the results for the example PDF you shared. Original PDF size: 6561805. Result size, your code: 788470. Result size, my code: 691147. Thus, even though there still is some more optimization potential, your shorter and faster code does remove the major duplicates, too.

    – mkl
    Nov 30 '18 at 9:33





    I just compared the results for the example PDF you shared. Original PDF size: 6561805. Result size, your code: 788470. Result size, my code: 691147. Thus, even though there still is some more optimization potential, your shorter and faster code does remove the major duplicates, too.

    – mkl
    Nov 30 '18 at 9:33













    Thanks for your investigations! Is the file with your code still PDF/A-1b compatible?

    – schowave
    Nov 30 '18 at 9:40







    Thanks for your investigations! Is the file with your code still PDF/A-1b compatible?

    – schowave
    Nov 30 '18 at 9:40















    "Is the file with your code still PDF/A-1b compatible?" - in case of your example file Adobe Acrobat 9.5 Preflight says it is.

    – mkl
    Nov 30 '18 at 9:55





    "Is the file with your code still PDF/A-1b compatible?" - in case of your example file Adobe Acrobat 9.5 Preflight says it is.

    – mkl
    Nov 30 '18 at 9:55













    Your solution worked for me, thank you @schowave

    – hab
    Dec 5 '18 at 19:58





    Your solution worked for me, thank you @schowave

    – hab
    Dec 5 '18 at 19:58













    4














    The code in this answer is an attempt to optimize documents like the OP's example document, i.e. documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. It does not merge merely nearly identical objects, e.g. multiple subsets of the same font into one single union subset.



    In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates.



    As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash.



    To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. To do this more effectively, the code initially not only collects all complex objects but also all references to each of them.



    The optimization code



    This is the method to call to optimize a PDDocument:



    public void optimize(PDDocument pdDocument) throws IOException {
    Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
    for (int pass = 0; ; pass++) {
    int merges = mergeDuplicates(complexObjects);
    if (merges <= 0) {
    System.out.printf("Pass %d - No merged objectsnn", pass);
    break;
    }
    System.out.printf("Pass %d - Merged objects: %dnn", pass, merges);
    }
    }


    (OptimizeAfterMerge method under test)



    The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged.



    The following helper methods and classes collect the complex objects of a PDF and the references to each of them:



    Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
    COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
    Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
    incomingReferences.put(catalogDictionary, new ArrayList<>());

    Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
    Set<COSBase> thisPass = new HashSet<>();
    while(!lastPass.isEmpty()) {
    for (COSBase object : lastPass) {
    if (object instanceof COSArray) {
    COSArray array = (COSArray) object;
    for (int i = 0; i < array.size(); i++) {
    addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
    }
    } else if (object instanceof COSDictionary) {
    COSDictionary dictionary = (COSDictionary) object;
    for (COSName key : dictionary.keySet()) {
    addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
    }
    }
    }
    lastPass = thisPass;
    thisPass = new HashSet<>();
    }
    return incomingReferences;
    }

    void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
    COSBase object = reference.getTo();
    if (object instanceof COSArray || object instanceof COSDictionary) {
    Collection<Reference> incoming = incomingReferences.get(object);
    if (incoming == null) {
    incoming = new ArrayList<>();
    incomingReferences.put(object, incoming);
    thisPass.add(object);
    }
    incoming.add(reference);
    }
    }


    (OptimizeAfterMerge helper methods findComplexObjects and addTarget)



    interface Reference {
    public COSBase getFrom();

    public COSBase getTo();
    public void setTo(COSBase to);
    }

    static class ArrayReference implements Reference {
    public ArrayReference(COSArray array, int index) {
    this.from = array;
    this.index = index;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.get(index));
    }

    @Override
    public void setTo(COSBase to) {
    from.set(index, to);
    }

    final COSArray from;
    final int index;
    }

    static class DictionaryReference implements Reference {
    public DictionaryReference(COSDictionary dictionary, COSName key) {
    this.from = dictionary;
    this.key = key;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.getDictionaryObject(key));
    }

    @Override
    public void setTo(COSBase to) {
    from.setItem(key, to);
    }

    final COSDictionary from;
    final COSName key;
    }


    (OptimizeAfterMerge helper interface Reference with implementations ArrayReference and DictionaryReference)



    And the following helper methods and classes finally identify and merge duplicates:



    int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
    List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
    for (COSBase object : complexObjects.keySet()) {
    hashes.add(new HashOfCOSBase(object));
    }
    Collections.sort(hashes);

    int removedDuplicates = 0;
    if (!hashes.isEmpty()) {
    int runStart = 0;
    int runHash = hashes.get(0).hash;
    for (int i = 1; i < hashes.size(); i++) {
    int hash = hashes.get(i).hash;
    if (hash != runHash) {
    int runSize = i - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
    }
    runHash = hash;
    runStart = i;
    }
    }
    int runSize = hashes.size() - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
    }
    }
    return removedDuplicates;
    }

    int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
    int removedDuplicates = 0;

    List<List<COSBase>> duplicateSets = new ArrayList<>();
    for (HashOfCOSBase entry : run) {
    COSBase element = entry.object;
    for (List<COSBase> duplicateSet : duplicateSets) {
    if (equals(element, duplicateSet.get(0))) {
    duplicateSet.add(element);
    element = null;
    break;
    }
    }
    if (element != null) {
    List<COSBase> duplicateSet = new ArrayList<>();
    duplicateSet.add(element);
    duplicateSets.add(duplicateSet);
    }
    }

    System.out.printf("Identified %d set(s) of identical objects in run.n", duplicateSets.size());

    for (List<COSBase> duplicateSet : duplicateSets) {
    if (duplicateSet.size() > 1) {
    COSBase surviver = duplicateSet.remove(0);
    Collection<Reference> surviverReferences = complexObjects.get(surviver);
    for (COSBase object : duplicateSet) {
    Collection<Reference> references = complexObjects.get(object);
    for (Reference reference : references) {
    reference.setTo(surviver);
    surviverReferences.add(reference);
    }
    complexObjects.remove(object);
    removedDuplicates++;
    }
    surviver.setDirect(false);
    }
    }

    return removedDuplicates;
    }

    boolean equals(COSBase a, COSBase b) {
    if (a instanceof COSArray) {
    if (b instanceof COSArray) {
    COSArray aArray = (COSArray) a;
    COSArray bArray = (COSArray) b;
    if (aArray.size() == bArray.size()) {
    for (int i=0; i < aArray.size(); i++) {
    if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
    return false;
    }
    return true;
    }
    }
    } else if (a instanceof COSDictionary) {
    if (b instanceof COSDictionary) {
    COSDictionary aDict = (COSDictionary) a;
    COSDictionary bDict = (COSDictionary) b;
    Set<COSName> keys = aDict.keySet();
    if (keys.equals(bDict.keySet())) {
    for (COSName key : keys) {
    if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
    return false;
    }
    // In case of COSStreams we strictly speaking should
    // also compare the stream contents here. But apparently
    // their hashes coincide well enough for the original
    // hashing equality, so let's just assume...
    return true;
    }
    }
    }
    return false;
    }

    static COSBase resolve(COSBase object) {
    while (object instanceof COSObject)
    object = ((COSObject)object).getObject();
    return object;
    }


    (OptimizeAfterMerge helper methods mergeDuplicates, mergeRun, equals, and resolve)



    static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
    public HashOfCOSBase(COSBase object) throws IOException {
    this.object = object;
    this.hash = calculateHash(object);
    }

    int calculateHash(COSBase object) throws IOException {
    if (object instanceof COSArray) {
    int result = 1;
    for (COSBase member : (COSArray)object)
    result = 31 * result + member.hashCode();
    return result;
    } else if (object instanceof COSDictionary) {
    int result = 3;
    for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
    result += entry.hashCode();
    if (object instanceof COSStream) {
    try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
    MessageDigest md = MessageDigest.getInstance("MD5");
    byte buffer = new byte[8192];
    int bytesRead = 0;
    while((bytesRead = data.read(buffer)) >= 0)
    md.update(buffer, 0, bytesRead);
    result = 31 * result + Arrays.hashCode(md.digest());
    } catch (NoSuchAlgorithmException e) {
    throw new IOException(e);
    }
    }
    return result;
    } else {
    throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
    }
    }

    final COSBase object;
    final int hash;

    @Override
    public int compareTo(HashOfCOSBase o) {
    int result = Integer.compare(hash, o.hash);
    if (result == 0)
    result = Integer.compare(hashCode(), o.hashCode());
    return result;
    }
    }


    (OptimizeAfterMerge helper class HashOfCOSBase)



    Applying the code to the OP's example document



    The OP's example document is about 6.5 MB in size. Applying the above code like this



    PDDocument pdDocument = PDDocument.load(SOURCE);

    optimize(pdDocument);

    pdDocument.save(RESULT);


    results in a PDF less than 700 KB in size, and it appears to be complete.



    (If something's missing, please tell, I'll try and fix that.)



    Words of warning



    On one hand this optimizer will not recognize all identical duplicates. In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles.



    On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.



    Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure (PDDocument, PDDocumentCatalog, PDAcroForm, ...). To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument instances and save it soon after without further ado.






    share|improve this answer


























    • Thank you, the code seems do do the job, too. As I was forced to provide a solution very quickly last week I chose to adopt the solution of schowave, as his view lines solution worked for me and my PDFs. BTW: In the meantime I found a working solution usint ITEXT 7, I'm going post a code example tomorrow.

      – hab
      Dec 5 '18 at 20:14
















    4














    The code in this answer is an attempt to optimize documents like the OP's example document, i.e. documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. It does not merge merely nearly identical objects, e.g. multiple subsets of the same font into one single union subset.



    In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates.



    As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash.



    To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. To do this more effectively, the code initially not only collects all complex objects but also all references to each of them.



    The optimization code



    This is the method to call to optimize a PDDocument:



    public void optimize(PDDocument pdDocument) throws IOException {
    Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
    for (int pass = 0; ; pass++) {
    int merges = mergeDuplicates(complexObjects);
    if (merges <= 0) {
    System.out.printf("Pass %d - No merged objectsnn", pass);
    break;
    }
    System.out.printf("Pass %d - Merged objects: %dnn", pass, merges);
    }
    }


    (OptimizeAfterMerge method under test)



    The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged.



    The following helper methods and classes collect the complex objects of a PDF and the references to each of them:



    Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
    COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
    Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
    incomingReferences.put(catalogDictionary, new ArrayList<>());

    Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
    Set<COSBase> thisPass = new HashSet<>();
    while(!lastPass.isEmpty()) {
    for (COSBase object : lastPass) {
    if (object instanceof COSArray) {
    COSArray array = (COSArray) object;
    for (int i = 0; i < array.size(); i++) {
    addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
    }
    } else if (object instanceof COSDictionary) {
    COSDictionary dictionary = (COSDictionary) object;
    for (COSName key : dictionary.keySet()) {
    addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
    }
    }
    }
    lastPass = thisPass;
    thisPass = new HashSet<>();
    }
    return incomingReferences;
    }

    void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
    COSBase object = reference.getTo();
    if (object instanceof COSArray || object instanceof COSDictionary) {
    Collection<Reference> incoming = incomingReferences.get(object);
    if (incoming == null) {
    incoming = new ArrayList<>();
    incomingReferences.put(object, incoming);
    thisPass.add(object);
    }
    incoming.add(reference);
    }
    }


    (OptimizeAfterMerge helper methods findComplexObjects and addTarget)



    interface Reference {
    public COSBase getFrom();

    public COSBase getTo();
    public void setTo(COSBase to);
    }

    static class ArrayReference implements Reference {
    public ArrayReference(COSArray array, int index) {
    this.from = array;
    this.index = index;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.get(index));
    }

    @Override
    public void setTo(COSBase to) {
    from.set(index, to);
    }

    final COSArray from;
    final int index;
    }

    static class DictionaryReference implements Reference {
    public DictionaryReference(COSDictionary dictionary, COSName key) {
    this.from = dictionary;
    this.key = key;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.getDictionaryObject(key));
    }

    @Override
    public void setTo(COSBase to) {
    from.setItem(key, to);
    }

    final COSDictionary from;
    final COSName key;
    }


    (OptimizeAfterMerge helper interface Reference with implementations ArrayReference and DictionaryReference)



    And the following helper methods and classes finally identify and merge duplicates:



    int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
    List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
    for (COSBase object : complexObjects.keySet()) {
    hashes.add(new HashOfCOSBase(object));
    }
    Collections.sort(hashes);

    int removedDuplicates = 0;
    if (!hashes.isEmpty()) {
    int runStart = 0;
    int runHash = hashes.get(0).hash;
    for (int i = 1; i < hashes.size(); i++) {
    int hash = hashes.get(i).hash;
    if (hash != runHash) {
    int runSize = i - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
    }
    runHash = hash;
    runStart = i;
    }
    }
    int runSize = hashes.size() - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
    }
    }
    return removedDuplicates;
    }

    int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
    int removedDuplicates = 0;

    List<List<COSBase>> duplicateSets = new ArrayList<>();
    for (HashOfCOSBase entry : run) {
    COSBase element = entry.object;
    for (List<COSBase> duplicateSet : duplicateSets) {
    if (equals(element, duplicateSet.get(0))) {
    duplicateSet.add(element);
    element = null;
    break;
    }
    }
    if (element != null) {
    List<COSBase> duplicateSet = new ArrayList<>();
    duplicateSet.add(element);
    duplicateSets.add(duplicateSet);
    }
    }

    System.out.printf("Identified %d set(s) of identical objects in run.n", duplicateSets.size());

    for (List<COSBase> duplicateSet : duplicateSets) {
    if (duplicateSet.size() > 1) {
    COSBase surviver = duplicateSet.remove(0);
    Collection<Reference> surviverReferences = complexObjects.get(surviver);
    for (COSBase object : duplicateSet) {
    Collection<Reference> references = complexObjects.get(object);
    for (Reference reference : references) {
    reference.setTo(surviver);
    surviverReferences.add(reference);
    }
    complexObjects.remove(object);
    removedDuplicates++;
    }
    surviver.setDirect(false);
    }
    }

    return removedDuplicates;
    }

    boolean equals(COSBase a, COSBase b) {
    if (a instanceof COSArray) {
    if (b instanceof COSArray) {
    COSArray aArray = (COSArray) a;
    COSArray bArray = (COSArray) b;
    if (aArray.size() == bArray.size()) {
    for (int i=0; i < aArray.size(); i++) {
    if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
    return false;
    }
    return true;
    }
    }
    } else if (a instanceof COSDictionary) {
    if (b instanceof COSDictionary) {
    COSDictionary aDict = (COSDictionary) a;
    COSDictionary bDict = (COSDictionary) b;
    Set<COSName> keys = aDict.keySet();
    if (keys.equals(bDict.keySet())) {
    for (COSName key : keys) {
    if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
    return false;
    }
    // In case of COSStreams we strictly speaking should
    // also compare the stream contents here. But apparently
    // their hashes coincide well enough for the original
    // hashing equality, so let's just assume...
    return true;
    }
    }
    }
    return false;
    }

    static COSBase resolve(COSBase object) {
    while (object instanceof COSObject)
    object = ((COSObject)object).getObject();
    return object;
    }


    (OptimizeAfterMerge helper methods mergeDuplicates, mergeRun, equals, and resolve)



    static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
    public HashOfCOSBase(COSBase object) throws IOException {
    this.object = object;
    this.hash = calculateHash(object);
    }

    int calculateHash(COSBase object) throws IOException {
    if (object instanceof COSArray) {
    int result = 1;
    for (COSBase member : (COSArray)object)
    result = 31 * result + member.hashCode();
    return result;
    } else if (object instanceof COSDictionary) {
    int result = 3;
    for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
    result += entry.hashCode();
    if (object instanceof COSStream) {
    try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
    MessageDigest md = MessageDigest.getInstance("MD5");
    byte buffer = new byte[8192];
    int bytesRead = 0;
    while((bytesRead = data.read(buffer)) >= 0)
    md.update(buffer, 0, bytesRead);
    result = 31 * result + Arrays.hashCode(md.digest());
    } catch (NoSuchAlgorithmException e) {
    throw new IOException(e);
    }
    }
    return result;
    } else {
    throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
    }
    }

    final COSBase object;
    final int hash;

    @Override
    public int compareTo(HashOfCOSBase o) {
    int result = Integer.compare(hash, o.hash);
    if (result == 0)
    result = Integer.compare(hashCode(), o.hashCode());
    return result;
    }
    }


    (OptimizeAfterMerge helper class HashOfCOSBase)



    Applying the code to the OP's example document



    The OP's example document is about 6.5 MB in size. Applying the above code like this



    PDDocument pdDocument = PDDocument.load(SOURCE);

    optimize(pdDocument);

    pdDocument.save(RESULT);


    results in a PDF less than 700 KB in size, and it appears to be complete.



    (If something's missing, please tell, I'll try and fix that.)



    Words of warning



    On one hand this optimizer will not recognize all identical duplicates. In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles.



    On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.



    Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure (PDDocument, PDDocumentCatalog, PDAcroForm, ...). To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument instances and save it soon after without further ado.






    share|improve this answer


























    • Thank you, the code seems do do the job, too. As I was forced to provide a solution very quickly last week I chose to adopt the solution of schowave, as his view lines solution worked for me and my PDFs. BTW: In the meantime I found a working solution usint ITEXT 7, I'm going post a code example tomorrow.

      – hab
      Dec 5 '18 at 20:14














    4












    4








    4







    The code in this answer is an attempt to optimize documents like the OP's example document, i.e. documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. It does not merge merely nearly identical objects, e.g. multiple subsets of the same font into one single union subset.



    In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates.



    As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash.



    To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. To do this more effectively, the code initially not only collects all complex objects but also all references to each of them.



    The optimization code



    This is the method to call to optimize a PDDocument:



    public void optimize(PDDocument pdDocument) throws IOException {
    Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
    for (int pass = 0; ; pass++) {
    int merges = mergeDuplicates(complexObjects);
    if (merges <= 0) {
    System.out.printf("Pass %d - No merged objectsnn", pass);
    break;
    }
    System.out.printf("Pass %d - Merged objects: %dnn", pass, merges);
    }
    }


    (OptimizeAfterMerge method under test)



    The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged.



    The following helper methods and classes collect the complex objects of a PDF and the references to each of them:



    Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
    COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
    Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
    incomingReferences.put(catalogDictionary, new ArrayList<>());

    Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
    Set<COSBase> thisPass = new HashSet<>();
    while(!lastPass.isEmpty()) {
    for (COSBase object : lastPass) {
    if (object instanceof COSArray) {
    COSArray array = (COSArray) object;
    for (int i = 0; i < array.size(); i++) {
    addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
    }
    } else if (object instanceof COSDictionary) {
    COSDictionary dictionary = (COSDictionary) object;
    for (COSName key : dictionary.keySet()) {
    addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
    }
    }
    }
    lastPass = thisPass;
    thisPass = new HashSet<>();
    }
    return incomingReferences;
    }

    void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
    COSBase object = reference.getTo();
    if (object instanceof COSArray || object instanceof COSDictionary) {
    Collection<Reference> incoming = incomingReferences.get(object);
    if (incoming == null) {
    incoming = new ArrayList<>();
    incomingReferences.put(object, incoming);
    thisPass.add(object);
    }
    incoming.add(reference);
    }
    }


    (OptimizeAfterMerge helper methods findComplexObjects and addTarget)



    interface Reference {
    public COSBase getFrom();

    public COSBase getTo();
    public void setTo(COSBase to);
    }

    static class ArrayReference implements Reference {
    public ArrayReference(COSArray array, int index) {
    this.from = array;
    this.index = index;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.get(index));
    }

    @Override
    public void setTo(COSBase to) {
    from.set(index, to);
    }

    final COSArray from;
    final int index;
    }

    static class DictionaryReference implements Reference {
    public DictionaryReference(COSDictionary dictionary, COSName key) {
    this.from = dictionary;
    this.key = key;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.getDictionaryObject(key));
    }

    @Override
    public void setTo(COSBase to) {
    from.setItem(key, to);
    }

    final COSDictionary from;
    final COSName key;
    }


    (OptimizeAfterMerge helper interface Reference with implementations ArrayReference and DictionaryReference)



    And the following helper methods and classes finally identify and merge duplicates:



    int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
    List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
    for (COSBase object : complexObjects.keySet()) {
    hashes.add(new HashOfCOSBase(object));
    }
    Collections.sort(hashes);

    int removedDuplicates = 0;
    if (!hashes.isEmpty()) {
    int runStart = 0;
    int runHash = hashes.get(0).hash;
    for (int i = 1; i < hashes.size(); i++) {
    int hash = hashes.get(i).hash;
    if (hash != runHash) {
    int runSize = i - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
    }
    runHash = hash;
    runStart = i;
    }
    }
    int runSize = hashes.size() - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
    }
    }
    return removedDuplicates;
    }

    int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
    int removedDuplicates = 0;

    List<List<COSBase>> duplicateSets = new ArrayList<>();
    for (HashOfCOSBase entry : run) {
    COSBase element = entry.object;
    for (List<COSBase> duplicateSet : duplicateSets) {
    if (equals(element, duplicateSet.get(0))) {
    duplicateSet.add(element);
    element = null;
    break;
    }
    }
    if (element != null) {
    List<COSBase> duplicateSet = new ArrayList<>();
    duplicateSet.add(element);
    duplicateSets.add(duplicateSet);
    }
    }

    System.out.printf("Identified %d set(s) of identical objects in run.n", duplicateSets.size());

    for (List<COSBase> duplicateSet : duplicateSets) {
    if (duplicateSet.size() > 1) {
    COSBase surviver = duplicateSet.remove(0);
    Collection<Reference> surviverReferences = complexObjects.get(surviver);
    for (COSBase object : duplicateSet) {
    Collection<Reference> references = complexObjects.get(object);
    for (Reference reference : references) {
    reference.setTo(surviver);
    surviverReferences.add(reference);
    }
    complexObjects.remove(object);
    removedDuplicates++;
    }
    surviver.setDirect(false);
    }
    }

    return removedDuplicates;
    }

    boolean equals(COSBase a, COSBase b) {
    if (a instanceof COSArray) {
    if (b instanceof COSArray) {
    COSArray aArray = (COSArray) a;
    COSArray bArray = (COSArray) b;
    if (aArray.size() == bArray.size()) {
    for (int i=0; i < aArray.size(); i++) {
    if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
    return false;
    }
    return true;
    }
    }
    } else if (a instanceof COSDictionary) {
    if (b instanceof COSDictionary) {
    COSDictionary aDict = (COSDictionary) a;
    COSDictionary bDict = (COSDictionary) b;
    Set<COSName> keys = aDict.keySet();
    if (keys.equals(bDict.keySet())) {
    for (COSName key : keys) {
    if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
    return false;
    }
    // In case of COSStreams we strictly speaking should
    // also compare the stream contents here. But apparently
    // their hashes coincide well enough for the original
    // hashing equality, so let's just assume...
    return true;
    }
    }
    }
    return false;
    }

    static COSBase resolve(COSBase object) {
    while (object instanceof COSObject)
    object = ((COSObject)object).getObject();
    return object;
    }


    (OptimizeAfterMerge helper methods mergeDuplicates, mergeRun, equals, and resolve)



    static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
    public HashOfCOSBase(COSBase object) throws IOException {
    this.object = object;
    this.hash = calculateHash(object);
    }

    int calculateHash(COSBase object) throws IOException {
    if (object instanceof COSArray) {
    int result = 1;
    for (COSBase member : (COSArray)object)
    result = 31 * result + member.hashCode();
    return result;
    } else if (object instanceof COSDictionary) {
    int result = 3;
    for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
    result += entry.hashCode();
    if (object instanceof COSStream) {
    try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
    MessageDigest md = MessageDigest.getInstance("MD5");
    byte buffer = new byte[8192];
    int bytesRead = 0;
    while((bytesRead = data.read(buffer)) >= 0)
    md.update(buffer, 0, bytesRead);
    result = 31 * result + Arrays.hashCode(md.digest());
    } catch (NoSuchAlgorithmException e) {
    throw new IOException(e);
    }
    }
    return result;
    } else {
    throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
    }
    }

    final COSBase object;
    final int hash;

    @Override
    public int compareTo(HashOfCOSBase o) {
    int result = Integer.compare(hash, o.hash);
    if (result == 0)
    result = Integer.compare(hashCode(), o.hashCode());
    return result;
    }
    }


    (OptimizeAfterMerge helper class HashOfCOSBase)



    Applying the code to the OP's example document



    The OP's example document is about 6.5 MB in size. Applying the above code like this



    PDDocument pdDocument = PDDocument.load(SOURCE);

    optimize(pdDocument);

    pdDocument.save(RESULT);


    results in a PDF less than 700 KB in size, and it appears to be complete.



    (If something's missing, please tell, I'll try and fix that.)



    Words of warning



    On one hand this optimizer will not recognize all identical duplicates. In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles.



    On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.



    Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure (PDDocument, PDDocumentCatalog, PDAcroForm, ...). To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument instances and save it soon after without further ado.






    share|improve this answer















    The code in this answer is an attempt to optimize documents like the OP's example document, i.e. documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. It does not merge merely nearly identical objects, e.g. multiple subsets of the same font into one single union subset.



    In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates.



    As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash.



    To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. To do this more effectively, the code initially not only collects all complex objects but also all references to each of them.



    The optimization code



    This is the method to call to optimize a PDDocument:



    public void optimize(PDDocument pdDocument) throws IOException {
    Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
    for (int pass = 0; ; pass++) {
    int merges = mergeDuplicates(complexObjects);
    if (merges <= 0) {
    System.out.printf("Pass %d - No merged objectsnn", pass);
    break;
    }
    System.out.printf("Pass %d - Merged objects: %dnn", pass, merges);
    }
    }


    (OptimizeAfterMerge method under test)



    The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged.



    The following helper methods and classes collect the complex objects of a PDF and the references to each of them:



    Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
    COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
    Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
    incomingReferences.put(catalogDictionary, new ArrayList<>());

    Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
    Set<COSBase> thisPass = new HashSet<>();
    while(!lastPass.isEmpty()) {
    for (COSBase object : lastPass) {
    if (object instanceof COSArray) {
    COSArray array = (COSArray) object;
    for (int i = 0; i < array.size(); i++) {
    addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
    }
    } else if (object instanceof COSDictionary) {
    COSDictionary dictionary = (COSDictionary) object;
    for (COSName key : dictionary.keySet()) {
    addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
    }
    }
    }
    lastPass = thisPass;
    thisPass = new HashSet<>();
    }
    return incomingReferences;
    }

    void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
    COSBase object = reference.getTo();
    if (object instanceof COSArray || object instanceof COSDictionary) {
    Collection<Reference> incoming = incomingReferences.get(object);
    if (incoming == null) {
    incoming = new ArrayList<>();
    incomingReferences.put(object, incoming);
    thisPass.add(object);
    }
    incoming.add(reference);
    }
    }


    (OptimizeAfterMerge helper methods findComplexObjects and addTarget)



    interface Reference {
    public COSBase getFrom();

    public COSBase getTo();
    public void setTo(COSBase to);
    }

    static class ArrayReference implements Reference {
    public ArrayReference(COSArray array, int index) {
    this.from = array;
    this.index = index;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.get(index));
    }

    @Override
    public void setTo(COSBase to) {
    from.set(index, to);
    }

    final COSArray from;
    final int index;
    }

    static class DictionaryReference implements Reference {
    public DictionaryReference(COSDictionary dictionary, COSName key) {
    this.from = dictionary;
    this.key = key;
    }

    @Override
    public COSBase getFrom() {
    return from;
    }

    @Override
    public COSBase getTo() {
    return resolve(from.getDictionaryObject(key));
    }

    @Override
    public void setTo(COSBase to) {
    from.setItem(key, to);
    }

    final COSDictionary from;
    final COSName key;
    }


    (OptimizeAfterMerge helper interface Reference with implementations ArrayReference and DictionaryReference)



    And the following helper methods and classes finally identify and merge duplicates:



    int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
    List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
    for (COSBase object : complexObjects.keySet()) {
    hashes.add(new HashOfCOSBase(object));
    }
    Collections.sort(hashes);

    int removedDuplicates = 0;
    if (!hashes.isEmpty()) {
    int runStart = 0;
    int runHash = hashes.get(0).hash;
    for (int i = 1; i < hashes.size(); i++) {
    int hash = hashes.get(i).hash;
    if (hash != runHash) {
    int runSize = i - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
    }
    runHash = hash;
    runStart = i;
    }
    }
    int runSize = hashes.size() - runStart;
    if (runSize != 1) {
    System.out.printf("Equal hash %d for %d elements.n", runHash, runSize);
    removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
    }
    }
    return removedDuplicates;
    }

    int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
    int removedDuplicates = 0;

    List<List<COSBase>> duplicateSets = new ArrayList<>();
    for (HashOfCOSBase entry : run) {
    COSBase element = entry.object;
    for (List<COSBase> duplicateSet : duplicateSets) {
    if (equals(element, duplicateSet.get(0))) {
    duplicateSet.add(element);
    element = null;
    break;
    }
    }
    if (element != null) {
    List<COSBase> duplicateSet = new ArrayList<>();
    duplicateSet.add(element);
    duplicateSets.add(duplicateSet);
    }
    }

    System.out.printf("Identified %d set(s) of identical objects in run.n", duplicateSets.size());

    for (List<COSBase> duplicateSet : duplicateSets) {
    if (duplicateSet.size() > 1) {
    COSBase surviver = duplicateSet.remove(0);
    Collection<Reference> surviverReferences = complexObjects.get(surviver);
    for (COSBase object : duplicateSet) {
    Collection<Reference> references = complexObjects.get(object);
    for (Reference reference : references) {
    reference.setTo(surviver);
    surviverReferences.add(reference);
    }
    complexObjects.remove(object);
    removedDuplicates++;
    }
    surviver.setDirect(false);
    }
    }

    return removedDuplicates;
    }

    boolean equals(COSBase a, COSBase b) {
    if (a instanceof COSArray) {
    if (b instanceof COSArray) {
    COSArray aArray = (COSArray) a;
    COSArray bArray = (COSArray) b;
    if (aArray.size() == bArray.size()) {
    for (int i=0; i < aArray.size(); i++) {
    if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
    return false;
    }
    return true;
    }
    }
    } else if (a instanceof COSDictionary) {
    if (b instanceof COSDictionary) {
    COSDictionary aDict = (COSDictionary) a;
    COSDictionary bDict = (COSDictionary) b;
    Set<COSName> keys = aDict.keySet();
    if (keys.equals(bDict.keySet())) {
    for (COSName key : keys) {
    if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
    return false;
    }
    // In case of COSStreams we strictly speaking should
    // also compare the stream contents here. But apparently
    // their hashes coincide well enough for the original
    // hashing equality, so let's just assume...
    return true;
    }
    }
    }
    return false;
    }

    static COSBase resolve(COSBase object) {
    while (object instanceof COSObject)
    object = ((COSObject)object).getObject();
    return object;
    }


    (OptimizeAfterMerge helper methods mergeDuplicates, mergeRun, equals, and resolve)



    static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
    public HashOfCOSBase(COSBase object) throws IOException {
    this.object = object;
    this.hash = calculateHash(object);
    }

    int calculateHash(COSBase object) throws IOException {
    if (object instanceof COSArray) {
    int result = 1;
    for (COSBase member : (COSArray)object)
    result = 31 * result + member.hashCode();
    return result;
    } else if (object instanceof COSDictionary) {
    int result = 3;
    for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
    result += entry.hashCode();
    if (object instanceof COSStream) {
    try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
    MessageDigest md = MessageDigest.getInstance("MD5");
    byte buffer = new byte[8192];
    int bytesRead = 0;
    while((bytesRead = data.read(buffer)) >= 0)
    md.update(buffer, 0, bytesRead);
    result = 31 * result + Arrays.hashCode(md.digest());
    } catch (NoSuchAlgorithmException e) {
    throw new IOException(e);
    }
    }
    return result;
    } else {
    throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
    }
    }

    final COSBase object;
    final int hash;

    @Override
    public int compareTo(HashOfCOSBase o) {
    int result = Integer.compare(hash, o.hash);
    if (result == 0)
    result = Integer.compare(hashCode(), o.hashCode());
    return result;
    }
    }


    (OptimizeAfterMerge helper class HashOfCOSBase)



    Applying the code to the OP's example document



    The OP's example document is about 6.5 MB in size. Applying the above code like this



    PDDocument pdDocument = PDDocument.load(SOURCE);

    optimize(pdDocument);

    pdDocument.save(RESULT);


    results in a PDF less than 700 KB in size, and it appears to be complete.



    (If something's missing, please tell, I'll try and fix that.)



    Words of warning



    On one hand this optimizer will not recognize all identical duplicates. In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles.



    On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.



    Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure (PDDocument, PDDocumentCatalog, PDAcroForm, ...). To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument instances and save it soon after without further ado.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Mar 8 at 14:28

























    answered Nov 29 '18 at 17:18









    mklmkl

    55.2k1170149




    55.2k1170149













    • Thank you, the code seems do do the job, too. As I was forced to provide a solution very quickly last week I chose to adopt the solution of schowave, as his view lines solution worked for me and my PDFs. BTW: In the meantime I found a working solution usint ITEXT 7, I'm going post a code example tomorrow.

      – hab
      Dec 5 '18 at 20:14



















    • Thank you, the code seems do do the job, too. As I was forced to provide a solution very quickly last week I chose to adopt the solution of schowave, as his view lines solution worked for me and my PDFs. BTW: In the meantime I found a working solution usint ITEXT 7, I'm going post a code example tomorrow.

      – hab
      Dec 5 '18 at 20:14

















    Thank you, the code seems do do the job, too. As I was forced to provide a solution very quickly last week I chose to adopt the solution of schowave, as his view lines solution worked for me and my PDFs. BTW: In the meantime I found a working solution usint ITEXT 7, I'm going post a code example tomorrow.

    – hab
    Dec 5 '18 at 20:14





    Thank you, the code seems do do the job, too. As I was forced to provide a solution very quickly last week I chose to adopt the solution of schowave, as his view lines solution worked for me and my PDFs. BTW: In the meantime I found a working solution usint ITEXT 7, I'm going post a code example tomorrow.

    – hab
    Dec 5 '18 at 20:14











    1














    An other way I found is using ITEXT 7 that way (pdfWriter.setSmartMode):



        try (PdfWriter pdfWriter = new PdfWriter(out)) {
    pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
    pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
    try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
    new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
    PdfMerger merger = new PdfMerger(pdfDoc);
    merger.setCloseSourceDocuments(true);
    try {
    for (InputStream pdf : pdfs) {
    try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
    merger.merge(doc, createPageList(doc.getNumberOfPages()));
    }
    }
    merger.close();
    }
    catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
    throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
    e);
    }
    catch (com.itextpdf.io.IOException | PdfException e) {
    throw new BieneException(e.getMessage(), e);
    }
    }
    }





    share|improve this answer




























      1














      An other way I found is using ITEXT 7 that way (pdfWriter.setSmartMode):



          try (PdfWriter pdfWriter = new PdfWriter(out)) {
      pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
      pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
      try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
      new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
      PdfMerger merger = new PdfMerger(pdfDoc);
      merger.setCloseSourceDocuments(true);
      try {
      for (InputStream pdf : pdfs) {
      try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
      merger.merge(doc, createPageList(doc.getNumberOfPages()));
      }
      }
      merger.close();
      }
      catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
      throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
      e);
      }
      catch (com.itextpdf.io.IOException | PdfException e) {
      throw new BieneException(e.getMessage(), e);
      }
      }
      }





      share|improve this answer


























        1












        1








        1







        An other way I found is using ITEXT 7 that way (pdfWriter.setSmartMode):



            try (PdfWriter pdfWriter = new PdfWriter(out)) {
        pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
        pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
        try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
        new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
        PdfMerger merger = new PdfMerger(pdfDoc);
        merger.setCloseSourceDocuments(true);
        try {
        for (InputStream pdf : pdfs) {
        try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
        merger.merge(doc, createPageList(doc.getNumberOfPages()));
        }
        }
        merger.close();
        }
        catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
        throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
        e);
        }
        catch (com.itextpdf.io.IOException | PdfException e) {
        throw new BieneException(e.getMessage(), e);
        }
        }
        }





        share|improve this answer













        An other way I found is using ITEXT 7 that way (pdfWriter.setSmartMode):



            try (PdfWriter pdfWriter = new PdfWriter(out)) {
        pdfWriter.setSmartMode(true); // Here happens the optimation, e.g. reducing redundantly embedded fonts
        pdfWriter.setCompressionLevel(Deflater.BEST_COMPRESSION);
        try (PdfDocument pdfDoc = new PdfADocument(pdfWriter, PdfAConformanceLevel.PDF_A_1B,
        new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", colorProfile))) {
        PdfMerger merger = new PdfMerger(pdfDoc);
        merger.setCloseSourceDocuments(true);
        try {
        for (InputStream pdf : pdfs) {
        try (PdfDocument doc = new PdfDocument(new PdfReader(pdf))) {
        merger.merge(doc, createPageList(doc.getNumberOfPages()));
        }
        }
        merger.close();
        }
        catch (com.itextpdf.kernel.crypto.BadPasswordException e) {
        throw new BieneException("Konkatenierung eines passwortgeschützten PDF-Dokumentes nicht möglich: " + e.getMessage(),
        e);
        }
        catch (com.itextpdf.io.IOException | PdfException e) {
        throw new BieneException(e.getMessage(), e);
        }
        }
        }






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 29 at 20:41









        habhab

        738




        738






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420344%2fhow-to-reduce-the-size-of-merged-pdf-a-1b-files-with-pdfbox-or-other-java-librar%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to change which sound is reproduced for terminal bell?

            Can I use Tabulator js library in my java Spring + Thymeleaf project?

            Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents