1 2012-02-16 09:13:58 (edited by jamjam 2012-03-09 10:31:10)

CDGroup merges related disc images to reduce combined file size.

This program takes a directory of images, processing them to remove redundancies to save space. It does this in four stages:

  • Stage 1: Split CD images (2352 byte/sector images) into their component parts. Model non-data parts where possible (ecc, edc, etc)

  • Stage 2: Group each data sector size to remove repeated aligned data (de-duplicate sectors)

  • Stage 3: Diff grouped files to remove repeated non-aligned data

  • Stage 4: Compress resulting files

Notes:

  • Stage 1 takes advantage of redundancies in the CD format

  • Stage 2 takes advantage of aligned redunduncies across related images

  • Stage 3 takes advantage of non-aligned redunduncies across related images

  • Stage 4 takes advantage of uncompressed data in the images

  • Each stage acts on the output of the previous stage

Features:

  • Reduces storage footprint by de-duplication of sectors (unique sectors are only stored once)

  • CDs, DVDs and Blu-Rays can be merged (any .iso with 2048 bytes/sector, and CDs .bin with 2352 bytes/sector)

  • Smart handling of 2352 byte/sector CD images (recognises audio, M0, M1, M2, M2F1, M2F2 sector types. Removes non-data portions of sectors such as ecc if it can be regenerated losslessly)

  • A mixture of 2352 and 2048 byte/sector images can be merged together

  • Identical input images should result in identical merged files

  • Images represented within the merged files can be renamed without unmerging (names stored in a text file *.hsa)

  • Supports merging combined image size of up to 4TiB

  • External diff support (jojodiff)

  • External compressor / decompressor support (7z)

  • Settings chosen to try and balance size reduction with practical time / hardware constraints


Usage: Requires java to run. Can be used as supplied in three ways:

  • Use jar files directly
    java -jar CDGroup.jar dir_to_merge
    java -jar unCDGroup.jar hsa_file

  • Use bat files (windows)
    CDGroup.bat dir_to_merge
    unCDGroup.bat hsa_file

  • Use sh files (linux)
    CDGroup.sh dir_to_merge
    unCDGroup.sh hsa_file

Note: v0.4.x marks a new format, different to that of v0.3.x.

CDGroup v0.4.2

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

2 2012-02-16 20:02:07

Too bad it is java smile

3 2012-02-16 22:17:06 (edited by jamjam 2012-06-27 18:08:01)

The tools are jar files, or jar files wrapped in exes. Both require java to be installed.[/strikethrough]

In my opinion having .net dependencies or similar is much more annoying. Java is near ubiquitous. What is it you don't like about java?

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

4 2012-02-17 23:30:51

Jamjam, 1 Quick question, if you have two near identical images, but both contain large dummy files or false LBA, its obvios its going to match those sectors, but will it create a rather large compressable file somewhere of all the matching dummy data or FLBA.

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

5 2012-02-17 23:55:57

I am not understanding what this is for, a better description please?

Plextor PX-760A 1.07 (+30) : Plextor PX-716SA 1.11 (+30) : Plextor PX-W5224A 1.04 (+30) : Plextor PX-W4824 1.07 (+30) : Plextor PX-W4012TA 1.07 (+98) : Plextor PX-W1610TA (+99) : Plextor PX-W1210TA 1.10 (+99) : Lite-On LTR-48246S (+6) : Lite-On LTR-52246S (+6) : Lite-On LH-20A1H LL0DN (+6) : BenQ DW1655 BCIB (+618) : ASUS DRW-2014L1 1.02 (+6) : Yamaha CRW-F1 (+733) : Optiarc SA-7290H5 1H44 (+48) : ASUS BW-16D1HT 3.02 (+6)

6 2012-02-18 00:21:43

@tossEAC
In a single pass, the data is split into pieces. Any identical pieces are stored only once, with another file listing all the places the piece is duplicated. Large zeroed files will be reduced to a single piece with a large list of locations.

If by false lba you mean the file system describes files beyond the size limit of the image, this will have no effect as the file system is not read.

@Nexy
If you have many related images, you can merge them with this to save space. Many regional variants of the same game for example would be a good candidate for merging.

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

7 2012-02-24 18:23:01

New version: v0.2.10

Differences from v0.2.1:

  • Better memory usage

  • Basic support to extract from zip and 7z files when unmerging (does not handle compressed files containing directories)

  • Partial sector matching option without having to do multiple passes (not used by default)

  • Bug fixes

  • Better description tongue

Note: Default behaviour has changed. There is now only one pass by default instead of three. If you want partial sector matching, it is recommended to use partial matching without multiple passes.

CDGroup v0.2.10

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

8 2012-02-28 03:27:33

New version, v0.3.1

  • New file type, GSA (stores metadata of original images)

  • GSN files now generic, GSA is the only place actual titles are stored

  • Improved 7z extraction, extract direct to required location

Note: The new file type means the merged files are slightly different to those made with v0.2.x. The new format makes the gsn and def files generic. Now you only need to edit the gsa file (and any cues/gdis if they exist) for renaming images, and an extra hash check is performed for validation.

This is probably the final format of the merged files. I can't see many reasons to change it from here.

CDGroup v0.3.1

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

9 2012-02-29 02:38:21 (edited by tossEAC 2012-02-29 09:36:40)

Resident Evil 4

---------------------------------------------------------------------

8 ngc raw isos = 10.8 GB

8 ngc pakkiso'd = 6.38 GB = 4.42 GB saved over raw iso

8 ngc raw CDGroup'd = 4.28 GB = 2.1 GB saved over pakkiso

8 ngc compress'd CDGroup'd = 3.07 GB = 1.21 GB saved over raw CDGroup

---------------------------------------------------------------------

Resident Evil 4 (Europe) (En,Fr,De,Es,It) (Disc 1).iso:1459978240:22DB91D40629DF75E63A019E0F9D72F5

Resident Evil 4 (USA) (Disc 2).iso:1459978240:2381ACD2199D6E7566932DF86901903D

Resident Evil 4 (Germany) (En,Fr,De,Es,It) (Disc 1).iso:1459978240:290D0B60E8B4D53A05C478D91A3CCDA2

Resident Evil 4 (Europe) (En,Fr,De,Es,It) (Disc 2).iso:1459978240:41A0F650DD5A80DC8D0C268037FF2320

Biohazard 4 (Japan) (Disc 2).iso:1459978240:499CBEC401C8D80D35B521B215E8D039

Biohazard 4 (Japan) (Disc 1).iso:1459978240:6B57D0AE0B872A50457276FB84D58CF2

Resident Evil 4 (Germany) (En,Fr,De,Es,It) (Disc 2).iso:1459978240:7B6C469694E9BB14EFFEC5D0017E08DD

Resident Evil 4 (USA) (Disc 1).iso:1459978240:CA749757E3B9D119F3FEB1F9F0F81BD7

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

10 2012-02-29 03:02:50

deletefiles=false (set in ini).

but its still deleting them when i merge.

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

11 2012-02-29 08:00:26 (edited by tossEAC 2012-02-29 09:41:06)

Star Wars - Rogue Squadron II - Rogue Leader

---------------------------------------------------------------------

5 ngc raw isos = 6.79 GB

5 ngc pakkiso'd = 6.13 GB = 0.66 GB saved over raw iso

5 ngc raw CDGroup'd = 2.40 GB GB = 3.73 GB saved over pakkiso

5 ngc compress'd CDGroup'd = 2.22 GB = 0.18 GB saved over raw CDGroup

---------------------------------------------------------------------

Star Wars - Rogue Squadron II - Rogue Leader (USA) md5 0BF391BEE90DA09D6042016D23E7A9B1

Star Wars - Rogue Squadron II - Rogue Leader (Europe) md5 15C49FF1DC836D19A1F275AC46EF4398

Star Wars - Rogue Squadron II - Rogue Leader (France) md5 776A1046F8C47D7AAD0CD732B53587CA

Star Wars - Rogue Squadron II - Rogue Leader (Spain) md5 BB7EEBBBB11FC8DC97F483A4DC3382CC

Star Wars - Rogue Squadron II - Rogue Leader (Germany) md5 FE32765CE6FCBB67FFD9E6F63D09CAE0

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

12 2012-02-29 08:00:58 (edited by tossEAC 2012-02-29 09:42:52)

Tales of Symphonia

---------------------------------------------------------------------

10 ngc raw isos = 13.5 GB

10 ngc pakkiso'd = 9.96 GB = 6.24 GB saved over raw iso

10 ngc raw CDGroup'd = 6.24 GB = 3.72 GB saved over pakkiso

10 ngc compress'd CDGroup'd = 5.25 GB = 0.99 GB saved over raw CDGroup

---------------------------------------------------------------------

Tales of Symphonia (France) (Disc 1) md5 057DD0466CFCD64957A880E212180D3B

Tales of Symphonia (USA) (Disc 2) md5 1078706DC88043BA6E7AB1A9AA708B32

Tales of Symphonia (France) (Disc 2) md5 83B3D38E342CD2A026B671908ED43574

Tales of Symphonia (Germany) (Disc 2) md5 97CE6F8EAB0272E7BBDC71DC96C6651A

Tales of Symphonia (Japan) (Disc 1) md5 A005A10A69A164F32D9D88AAE8013ECF

Tales of Symphonia (USA) (Disc 1) md5 A427C43797AB0AD5C024014142D1C3E1

Tales of Symphonia (Europe) (Disc 1) md5 A797655B74D8218D5F0C8229F393AB60

Tales of Symphonia (Europe) (Disc 2) md5 EE81AC0AFDCA6D3930CAD14370CBEA75

Tales of Symphonia (Japan) (Disc 2) md5 FB96C11A1B3BED38942DE37A5B6A72EA

Tales of Symphonia (Germany) (Disc 1) md5 FC4CAEAA8BAFBA662ABB400DE9C97579

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

13 2012-02-29 08:20:39 (edited by tossEAC 2012-02-29 23:11:04)

Tiger Woods PGA Tour 2004

---------------------------------------------------------------------

6 ngc raw isos = 8.14 GB

6 ngc pakkiso'd = 4.38 GB = 3.76 GB saved over raw iso

6 ngc raw CDGroup'd = 1.77 GB = 2.61 GB saved over pakkiso

6 ngc compress'd CDGroup'd = 1.17 GB = 0.6 GB saved over raw CDGroup

---------------------------------------------------------------------

Tiger Woods PGA Tour 2004 (Europe) (Disc 1) (v1.00) md5 2AE87A7103E8D24BD66C7908B455A863

Tiger Woods PGA Tour 2004 (USA) (Disc 2) md5 4DF313D253F54222944F382ECE74EF81

Tiger Woods PGA Tour 2004 (USA) (Disc 1) md5 81BA9B53904348A25DB2EE0AE46D3B6C

Tiger Woods PGA Tour 2004 (Europe) (Disc 2) (v1.01) md5 95FD974F151C4A48FD65E3F8C8158419

Tiger Woods PGA Tour 2004 (Europe) (Disc 2) (v1.00) md5 A53EE73D2A01BB136B5CC52703C8E228

Tiger Woods PGA Tour 2004 (Europe) (Disc 1) (v1.01) md5 DC8A8164E37CB0BAFD982A4D30946E8A

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

14 2012-02-29 08:48:52 (edited by tossEAC 2012-02-29 11:15:11)

Super Smash Bros. Melee

---------------------------------------------------------------------

5 ngc raw isos = 6.79 GB

5 ngc pakkiso'd = 4.63 GB = 2.16 GB saved over raw iso

5 ngc raw CDGroup'd = 1.99 GB = 2.64 GB saved over pakkiso

5 ngc compress'd CDGroup'd = 1.41 GB = 0.58 GB saved over raw CDGroup

---------------------------------------------------------------------

Super Smash Bros. Melee (USA) (v1.02) md5 0E63D4223B01D9ABA596259DC155A174

Dairantou Smash Brothers DX (Japan) (v1.00) md5 378BE81BB6C38FEBD847FC4B7F7DC36F

Super Smash Bros. Melee (USA) (En,Ja) (v1.00) md5 3A62F8D10FD210D4928AD37E3816E33C

Super Smash Bros. Melee (USA) (v1.01) md5 67136BD167B471E0AD72E98D10CF4356

Dairantou Smash Brothers DX (Japan) (v1.02) md5 DC07ABD4B6A5E1517DA575274CEEFCF8

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

15 2012-02-29 12:23:16

Resident Evil

---------------------------------------------------------------------

6 ngc raw isos = 8.15 GB

6 ngc pakkiso'd = 4.79 GB = 3.36 GB saved over raw iso

6 ngc raw CDGroup'd = 4.85 GB = nothing saved over pakkiso

6 ngc compress'd CDGroup'd = 3.59 GB = 1.26 GB saved over pakkiso

---------------------------------------------------------------------

Biohazard (Japan) (Disc 1) md5 20CB8D4CB322AA503D1B8A49C43CDEBF

Resident Evil (Europe) (En,Fr,De,Es,It) (Disc 2) md5 457944F833FC2F5E8FF394CFDF2E1B7C

Resident Evil (USA) (Disc 2) md5 7DEFD099E98944BC93684D4733BFE68B

Resident Evil (USA) (Disc 1) md5 BDD0FE3848C4AB1441DC6C9EE209426B

Biohazard (Japan) (Disc 2) md5 BFBF8E0F249CF8DD8FCB913793301A8C

Resident Evil (Europe) (En,Fr,De,Es,It) (Disc 1) md5 C581FAB5FD10F55B76188E86194199C1

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

16 2012-02-29 23:12:22

TimeSplitters - Future Perfect

---------------------------------------------------------------------

2 ngc raw isos = 2.71 GB

2 ngc pakkiso'd = 1.63 GB = 1.08 GB saved over raw iso

2 ngc raw CDGroup'd = 1.37 GB = 0.26 GB saved over pakkiso

2 ngc compress'd CDGroup'd = 868 MB = 534 MB saved over raw CDGroup

---------------------------------------------------------------------

TimeSplitters - Future Perfect (Europe) md5 8743DFA398DE3E3448FEBE87D75D626C

TimeSplitters - Future Perfect (USA) md5 AF8710863BAD728DE9A26231D398E99B

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

17 2012-03-01 03:00:55

Mario Party 4

---------------------------------------------------------------------

4 ngc raw isos = 5.43 GB

4 ngc pakkiso'd = 4.37 GB = 1.06 GB saved over raw iso

4 ngc raw CDGroup'd = 2.71 GB = 1.66 GB saved over pakkiso

4 ngc compress'd CDGroup'd = 2.19 GB = 0.52 GB saved over raw CDGroup

---------------------------------------------------------------------

Mario Party 4 (USA) (v1.01) md5 01DE13AD52F1554975E7F316370BA086

Mario Party 4 (Europe) (En,Fr,De,Es,It) (v1.00) md5 5DC6F6949F09DA56FD0313CBDB85FCA5

Mario Party 4 (USA) (v1.00) md5 6EDA0E31DCF1622EF749C5CE8F1196A3

Mario Party 4 (Europe) (En,Fr,De,Es,It) (v1.02) md5 F6740FC00D9818AD1F161889382C7902

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.

18 2012-03-01 10:26:19 (edited by gaijin 2012-03-01 14:28:11)

For me CDGroup gives a small gain after compression. I use raw images (or ecm for CD) + FreeArc (better solid compression than 7z). CDGroup can be used only for himself, not for share or quick access to images.

TEST: Lost Kingdoms II (USA+Europe+Japan) (1.36 + 1.36 + 1.36 GB)

Solid 7z raw isos = Lost Kingdoms II (USA+Europe+Japan).7z  -  2678068917 (2.49 GB)

CDGroup + packed 7z = Lost Kingdoms II (USA+Europe+Japan).7z - 2676988096 (2.49 GB)

FreeArc raw isos = Lost Kingdoms II (USA+Europe+Japan).arc     - 1685900893 (1.56 GB)

CDGroup + packed FreeArc = Lost Kingdoms II (USA+Europe+Japan).arc  -  1706886824 (1.58 GB)


If do patches USA->Europe->Japan will be even smaller (maniac version smile)

19 2012-03-01 19:08:00

New version CDGroup v0.3.2

Differences from v0.3.1

  • Old code cleanup. Some options (like deletefiles) removed

  • Improve cross-platform compatibility

  • Remove J7zip as a 7z decompressor (can't handle some 7z files, presumably solid archives)

  • External compressor / decompressor support (t7z.exe currently in there)

Note: You can define your own external compressor / decompressor using options.ini

CDGroup v0.3.2

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

20 2012-03-02 09:15:19 (edited by gaijin 2012-03-02 09:19:30)

I tried use more number of passes and get this error

http://iceimg.com/t/4f/3b/42f118991c.jpg

21 2012-03-02 14:31:09

Looks like you ran out of ram because the piece size was too small for the size of the largest file being grouped. If you didn't actually run out of ram (just the ram java had access to), try executing with the -Xmx flag like so:
java -Xmx1000m -jar CDGroup.jar ... (to give the program 1000MB of ram to work with for example).

Note this is partly why multiple passes are not recommended, nor is an initial pass of anything other than 1. They eat ram when given a large enough file and / or small enough piece size. The standard settings should allow most people to merge most common file sizes (given 512MiB of ram it should easily handle full dual layer DVD sized images). Multiple passes will probably be removed. Initial pass may be kept in for experimentation.

How multiple passes works (and why you shouldn't use it)

The binary result from one pass is the input for the next. The only benefit (compared to changing the initial pass and keeping passes at 1) is decreased overhead (the size of the combined gsm files will be smaller). But multiple passes has disadvantages that far outweigh it:

  • After each pass, sectors are more 'mixed' than they were before

  • Maximum concurrent ram usage doubles wrt the biggest file after each pass

  • The same data is being grouped again and again in a nested fashion

The concurrent ram usage is where there's a real problem. Maximum concurrent ram usage in bytes is estimated as the number of pieces in the biggest file being merged multiplied by 48 (plus some overhead). With normal settings (sector matching), a 1Gib file would have 524288 pieces, taking up roughly 25Mib of ram (non-contiguous). This is fine for any sensible image size.

Taking 2048 as an example, for each successive pass the piece size is halved so the number of pieces the file is split into has doubled. The program ran out of memory at pass '256', where maximum concurrent ram usage would be roughly 8*25 = 200MiB per GiB of the maximum file size. Using passes instead of initialpass makes this worse, as a large file could have been made earlier, increasing concurrent ram usage further.

Freearc testing

As for your freearc testing, the only reason I can see for the result is that the repeated sectors are pulled out of where they are and placed in gs0. This may mess with a compressor as it could put unrelated sectors close together (although the sectors are ordered as naturally as possible within the gs0). Something else to consider is that as the images are small, freearc may have been able to see matching sectors from two images next to each other. It seems unlikely that the freearc performance would scale to bigger images (like full DVD images for example).

There are some things I can see to improve compressibility (or at least the chances of improved compressibility):

  • Instead of pulling repeated sectors out to gs0, remove them from everywhere except the first location they are present

  • Swap the name with the extension in the resulting files. I know freearc can order files any which way, but t7z orders them by extension, which is possibly the worst way to order them with the files named as they currently are (this would only affect merges with CDs with multiple sector types, which currently are compressed in the order 2048.gs0, 2324.gs0, 2048.gs1, 2324.gs1 ...)

  • Store gsm file in a different manner (gsm is the overhead when merging). There's a way to store the gsm such that it is the same size and does the same thing, but is more compressible

If the first is implemented it seems very unlikely that freearc will produce a smaller file from raw images than the merged images.

Anything else that anyone thinks can improve compressibility let me know here or in pm.

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

22 2012-03-03 11:40:03 (edited by gaijin 2012-03-03 11:45:54)

I use -Xms flag, -Xmx don't work for me.

And a little more tests:

Crisis Core - Final Fantasy VII (EN+US+US+DE+IT+ES+FR+JP+JP) (PSP) (14.6 GB)

FreeArc raw 9 isos = around 7-8 GB (maybe solid not work for big size)

CDGroup + packed  FreeArc = 2.19 GB

maniac version  - EU -> convert patches + packed FreeArc -> US+US+DE+IT+ES+FR+JP+JP = 2.16 GB (I made them early)

FreeArc totally lost smile

Resident Evil (USA+Europe+Japan) (GameCube) (8.15 GB)

FreeArc raw 6 isos = 3 GB

CDGroup + packed  FreeArc = 2.06 GB


There is a request if possible to make ability to extract one needed file, not unmerge all.

23 2012-03-03 19:58:40

It's possible to implement extraction of a single image. However the current direction I'm heading in making the data more compressible means more of the files are needed to extract a single image (the first bullet point on the todo list above). This makes single image extraction fine for uncompressed files, but if they're compressed it's either messy (don't delete extracted working files) or wasteful (extracting multiple single images, deleting the working files each time means multiple decompression of the same file).

Something I'm toying with is having an external diff stage after grouping and before external compression. This could really reduce the benefit of single image extraction (if it takes 90% of the effort to extract a single image, you might as well go the last 10% to get the rest).

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

24 2012-03-09 10:17:24

New in v0.4.2:

  • New merged format

  • New stage (external diff)

  • Eliminated ram usage dependency on maximum image size

  • Multiple passes removed when grouping

  • Ram limit removed when ungrouping (obsolete)

The new format aims to be more compressible. The differences from the old format are:

  • All extensions are now .h* instead of .g* (for clarity)

  • There is no longer a repeated sector store (gs0). Instead, a repeated sector is stored in the first location it is present within the image stores (hs1, hs2 ...) (keep related data together)

  • hsm files are stored differently (more compressible)

  • Sector storage extension and name is swapped (may help some external compressors order sensibly by extension)

The new external diff stage is to remove repeated data that is not aligned to sector boundaries (shifted data for example). For some inputs this stage plays a good role in crunching down the file size, in others the previous grouping stage did most of the leg work.

CDGroup v0.4.2

PS3Dec (decrypt ps3 images), PS3DumpCheck (check integrity), GetKey (dump PS3 metadata), DatSplit (split redump dats), GPack (compress related images together)

25 2012-03-09 12:49:05

Will test soon, on something from above to see how it compares to the old format.

He who controls the SPICE... controls the UNIVERSE!
The SPICE must flow.