If you've never archived files off-line (& don't suffer from OCD), then you'll probably never have found the requirement to find the subset of them which best fits the available space. Let's say you've got a collection of files, which you can no longer afford to store on your computer's HDD, or which you want to backup (as one tends to after suffering one's first catastrophic disk-failure). The backup-medium could be a CD, DVD, or flash-drive; it makes no difference, the available space is finite & of well-defined size. Adding the total space-requirement of the files, & dividing it by the space available on a DVD, tells you the minimum number of DVDs that will be required if no space is wasted, but when you try to group the files into subsets (each of which must fit within the space available on one DVD, without wasting any), you find it unexpectedly laborious.
If all the files were the same size, then the number of files in each subset would be known1, & the choice of files in any subset would only be relevant for reasons outside the domain of this problem, since this choice doesn't affect the space which is wasted. The fact that they're of different sizes, makes the choice of files in each subset, the determining factor in the amount of space wasted when copied to the off-line storage-medium.
Though one could gzip the files to reduce the required space, that merely changes the parameters of the problem, rather than side-stepping it. Since the size of any subset is independent of the order in which the files are listed, we're not interested in the permutations of files into subsets for archiving; just the combinations.
If you had just two files, there'd only be three possible subsets; assuming that Ø isn't a useful solution.
If you had just three files a, b & c, there'd be seven possible subsets; a, b, c, ab, ac, bc, abc.
If you had n files, though the pattern is clear, each file must either be inside or outside a given subset, so with just these two possibilities for each file, the total number of combinations of subsets is 2n (i.e. exponential), of which 2n - 1 are interesting2.
This is one of a set of problems with O(cn) time-complexity. Such problems are computationally intractable since the time required to solve them explodes as the size of the input-data increases. Having a computer (or brain) which is c times as fast, only allows one to solve problems with just one more input datum, in a given time-period. The solution-time grows like the apocryphal grains of wheat on a chess-board; or in this case, the number of files to store.
This problematic time-complexity has a corresponding advantage;
with so many file-combinations from which to select one's solution, the likelihood of a good fit can be very high,
& with a favourable distribution of file-sizes relative to the available space,
exact matches are not uncommon
… you might just have to wait until the heat-death of the universe4 before it's found.
If one doesn't take care to dispose of sub-optimal solutions as soon as possible, retaining only those of significance, then one may stumble over the potential for O(2n) space-complexity, & the resulting insatiable appetite for RAM will quickly cripple your computer.
The exponential time-complexity of the problem, makes an exhaustive search of possible solutions tractable only when the number of files is small. The algorithm coded in squeeze deals with the it by a variety of strategies:
Initially, it removes files individually larger than the available storage-space, & which can't therefore be part of any solution, & checks that the aggregate size of all files is sufficient, for any acceptable solution to be possible.
After filling the available space by any one subset of files, all the combinations of the remaining files, for which zero space now remains, become irrelevant. As a result of this, the time-complexity of the problem can be significantly reduced.
It attempts to take a reasonably intelligent route through the solution-space. Just as if you were packing a removal-van, it starts with the large files, leaving smaller ones to wedge into the awkward gaps which remain, as required for a good solution.
It presents solutions of increasing suitability as they're found, rather than exhaustively searching the entire solution-space, then presenting the optimum. This allows sub-optimal results to be used, if they're deemed adequate, without wasting time on diminishing returns.
It frees the heap associated with solutions as soon as they're know to be sub-optimal, allowing the application to function in O(1) space.
It reduces the number of file-combinations by allowing one to specify atomic groups of files. Say one had a collection of video files; a series composed from a sequence of episodes, & a collection of unrelated films, that you wish to store on DVD. For convenience, one might reasonably want to keep the episodes of the series together on one DVD, then fill the gaps with any combination of films. By treating the whole TV-series as one atomic group, one can reduce the exponent of the problem's time-complexity by (episodes - 1).
If more than one CPU-core is available, then solutions can be investigated in parallel (by supplying the
ls -l */*-rw-r--r-- 1 al users 1370849940 2010-02-11 01:14 Deadwood_1/Deadwood_1_06_Plague.m2t -rw-r--r-- 1 al users 1579066144 2010-02-18 01:14 Deadwood_1/Deadwood_1_07_BullockReturnsToCamp.m2t -rw-r--r-- 1 al users 1628667500 2010-02-25 01:19 Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t -rw-r--r-- 1 al users 1475753752 2010-03-04 01:24 Deadwood_1/Deadwood_1_09_NoOtherSonsOrDaughters.m2t -rw-r--r-- 1 al users 1455273972 2010-03-11 01:19 Deadwood_1/Deadwood_1_10_MisterWu.m2t -rw-r--r-- 1 al users 1536891916 2010-03-18 02:24 Deadwood_1/Deadwood_1_11_JewelsBootIsMadeForWalking.m2t -rw-r--r-- 1 al users 1506550596 2010-03-25 01:24 Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t -rw-r--r-- 1 al users 1316473008 2009-12-23 01:14 Deadwood_3/Deadwood_3_02_IAmNotTheFineManYouTakeMeFor.m2t -rw-r--r-- 1 al users 1354711456 2009-12-24 01:14 Deadwood_3/Deadwood_3_03_TrueColors.m2t -rw-r--r-- 1 al users 1317241928 2009-12-25 01:14 Deadwood_3/Deadwood_3_04_FullFaithAndCredit.m2t -rw-r--r-- 1 al users 1339957592 2009-12-26 01:44 Deadwood_3/Deadwood_3_05_ATwoHeadedBeast.m2t -rw-r--r-- 1 al users 1330129328 2009-12-27 01:44 Deadwood_3/Deadwood_3_06_ARichFind.m2t -rw-r--r-- 1 al users 1134503672 2009-12-28 01:09 Deadwood_3/Deadwood_3_07_UnauthorizedCinnamon.m2t -rw-r--r-- 1 al users 1279459944 2009-12-29 01:14 Deadwood_3/Deadwood_3_08_LeviathanSmiles.m2t -rw-r--r-- 1 al users 1258818672 2009-12-30 01:09 Deadwood_3/Deadwood_3_09_AmateurNight.m2t -rw-r--r-- 1 al users 1175977224 2009-12-31 01:09 Deadwood_3/Deadwood_3_10_AConstantThrob.m2t -rw-r--r-- 1 al users 1364831120 2010-01-01 01:14 Deadwood_3/Deadwood_3_11_TheCatBirdSeat.m2t -rw-r--r-- 1 al users 1260693784 2010-01-02 01:09 Deadwood_3/Deadwood_3_12_TellHimSomethingPretty.m2t -rw-r--r-- 1 al users 3525287076 2010-06-06 00:54 Films/BatmanBegins.m2t -rw-r--r-- 1 al users 1736242604 2010-06-15 00:59 Films/OfficeSpace.m2t -rw-r--r-- 1 al users 2086362148 2010-07-18 23:24 Films/TheAmericanPresident.m2t -rw-r--r-- 1 al users 2631969732 2010-07-27 00:39 Films/TheImportanceOfBeingEarnest.m2t -rw-r--r-- 1 al users 2112377964 2010-07-10 00:19 Films/ThePaper.m2t -rw-r--r-- 1 al users 914662864 2010-03-01 20:04 LeadBalloon_1/LeadBalloon_1_02_Wayne.m2t -rw-r--r-- 1 al users 831265500 2010-03-15 20:04 LeadBalloon_1/LeadBalloon_1_03_5000Pounds.m2t -rw-r--r-- 1 al users 795453568 2010-03-29 20:04 LeadBalloon_1/LeadBalloon_1_04_Allergic.m2t -rw-r--r-- 1 al users 914268816 2010-04-19 03:04 LeadBalloon_1/LeadBalloon_1_05_Pistachio.m2t -rw-r--r-- 1 al users 1113087652 2010-04-20 03:04 LeadBalloon_1/LeadBalloon_1_06_Fatty.m2t -rw-r--r-- 1 al users 724333356 2010-07-26 03:04 LeadBalloon_2/LeadBalloon_2_06_Debacle.m2t -rw-r--r-- 1 al users 605369024 2010-07-22 02:49 LeadBalloon_2/LeadBalloon_2_07_Rita.m2t
Suppose one wants to find the subset of them, which fits into the 4.7GB available on a DVD, with minimal waste. The total space-requirement is ten disks5, which represents the barrier beneath which no packing-solution can go, & which the optimal solution can only approach.
If one just filled DVDs, arbitrarily using the order in which the files are listed above, then the sum of the wasted tail-ends would be equivalent to nearly two disks.
|DVD||Video Files||Bytes Wasted|
|Total Bytes Wasted||9,023,468,148|
If you were expecting some bloated GUI, then prepare yourself for disappointment; "squeeze" is deliberately just a command-line utility, to better integrate with other such utilities, & because a graphical interface, besides looking good, adds no obvious value.
cabal install squeezeResolving dependencies… Configuring squeeze-220.127.116.11… Building squeeze-18.104.22.168… Preprocessing library squeeze-22.214.171.124… [1 of 5] Compiling Squeeze.Control.Concurrent.DivideAndConquer ( src-lib/Squeeze/Control/Concurrent/DivideAndConquer.hs, dist/build/Squeeze/Control/Concurrent/DivideAndConquer.o ) [2 of 5] Compiling Squeeze.Data.File ( src-lib/Squeeze/Data/File.hs, dist/build/Squeeze/Data/File.o ) [3 of 5] Compiling Squeeze.Data.FileCombination ( src-lib/Squeeze/Data/FileCombination.hs, dist/build/Squeeze/Data/FileCombination.o ) [4 of 5] Compiling Squeeze.Data.CommandOptions ( src-lib/Squeeze/Data/CommandOptions.hs, dist/build/Squeeze/Data/CommandOptions.o ) [5 of 5] Compiling Squeeze.Squeeze ( src-lib/Squeeze/Squeeze.hs, dist/build/Squeeze/Squeeze.o ) In-place registering squeeze-126.96.36.199… Preprocessing executable 'squeeze' for squeeze-188.8.131.52… [1 of 3] Compiling Squeeze.Test.Performance ( src-exe/Squeeze/Test/Performance.hs, dist/build/squeeze/squeeze-tmp/Squeeze/Test/Performance.o ) [2 of 3] Compiling Paths_squeeze ( dist/build/autogen/Paths_squeeze.hs, dist/build/squeeze/squeeze-tmp/Paths_squeeze.o ) [3 of 3] Compiling Main ( src-exe/Main.hs, dist/build/squeeze/squeeze-tmp/Main.o ) Linking dist/build/squeeze/squeeze … Installing library in /home/al/.cabal/lib/squeeze-184.108.40.206/ghc-7.6.3 Installing executable(s) in /home/al/.cabal/bin Registering squeeze-220.127.116.11 … Installed squeeze-18.104.22.168 $
~/.cabal/bin/squeeze --maximumBytes=4700000000 */*.m2t4672110012 ["Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t","Deadwood_1/Deadwood_1_11_JewelsBootIsMadeForWalking.m2t","Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t"] 4683487396 ["Deadwood_1/Deadwood_1_07_BullockReturnsToCamp.m2t","Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t","Deadwood_1/Deadwood_1_09_NoOtherSonsOrDaughters.m2t"] 4698067172 ["Deadwood_1/Deadwood_1_10_MisterWu.m2t","Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t","Films/OfficeSpace.m2t"] 4699892276 ["Deadwood_3/Deadwood_3_03_TrueColors.m2t","Deadwood_3/Deadwood_3_09_AmateurNight.m2t","Films/TheAmericanPresident.m2t"]
The fact that each of the solutions presented, contains exactly three files, is merely a coincidence promoted by the statistical likelihood of this type of solution, resulting from the average file-size relative to the available storage-space; other possibilities were tested, but none made the grade.
Grouping Related Files
The solution found so far is adequate, but one may want to keep the episodes of a series together in one archive. In their current file-format, this is unlikely to work; the files are too large to fit onto one DVD in any order. The file-size can be dramatically reduced, by snipping out the advert-breaks using dvbcut, & then using ffmpeg & x.264, to convert the video stream to the more efficient H.264-format; in this case wrapped with the original audio stream inside a MKV-container. For the purposes of this example, I'll just edit & re-encode a single directory of video files, corresponding to the episodes of one season of a series.
ls -l 'Deadwood_3/'-rw-r--r-- 1 al users 315054028 2010-01-05 08:57 Deadwood_3_02_IAmNotTheFineManYouTakeMeFor.mkv -rw-r--r-- 1 al users 287254198 2010-01-05 08:57 Deadwood_3_03_TrueColors.mkv -rw-r--r-- 1 al users 288151700 2010-01-05 08:57 Deadwood_3_04_FullFaithAndCredit.mkv -rw-r--r-- 1 al users 340534828 2010-01-05 08:57 Deadwood_3_05_ATwoHeadedBeast.mkv -rw-r--r-- 1 al users 303210766 2010-01-05 08:57 Deadwood_3_06_ARichFind.mkv -rw-r--r-- 1 al users 186886160 2010-01-05 08:58 Deadwood_3_07_UnauthorizedCinnamon.mkv -rw-r--r-- 1 al users 263138794 2010-01-05 08:58 Deadwood_3_08_LeviathanSmiles.mkv -rw-r--r-- 1 al users 324496358 2010-01-05 08:58 Deadwood_3_09_AmateurNight.mkv -rw-r--r-- 1 al users 228694008 2010-01-05 08:58 Deadwood_3_10_AConstantThrob.mkv -rw-r--r-- 1 al users 313909470 2010-01-05 08:58 Deadwood_3_11_TheCatBirdSeat.mkv -rw-r--r-- 1 al users 285970588 2010-01-05 08:58 Deadwood_3_12_TellHimSomethingPretty.mkv
This directory is just 22%6 of the original size, which now represents 67% of the space available on a DVD, leaving space into which a subset of the remaining files can fit.
One can now re-run the test, requiring that any solution includes either this directory of associated files in its entirety, or doesn't include any member of it at all, to see if it can now be packed acceptably with the remaining uncompressed MPEG-2 files. This requirement is communicated to the executable "squeeze", merely by specifying the directory-name, rather than, as before, specifying the individual file-names it contains.
time squeeze 'Deadwood_3/' */*.m2t #NB: by specifying a directory, we've implicitly requested atomicity.4674192814 ["Deadwood_3/","Deadwood_1/Deadwood_1_11_JewelsBootIsMadeForWalking.m2t"] 4683487396 ["Deadwood_1/Deadwood_1_07_BullockReturnsToCamp.m2t","Deadwood_1/Deadwood_1_08_SufferTheLittleChildren.m2t","Deadwood_1/Deadwood_1_09_NoOtherSonsOrDaughters.m2t"] 4698067172 ["Deadwood_1/Deadwood_1_10_MisterWu.m2t","Deadwood_1/Deadwood_1_12_SoldUnderSin.m2t","Films/OfficeSpace.m2t"]
This time, I didn't explicitly specify the available space, but rather relied on a hard-coded default-value.
An acceptable (though sub-optimal) solution was found, which includes the entire re-encoded directory, & which wasted only 0.55% of the available space. As a side-effect of the request to treat eleven files, as one atomic directory, the total number of file-combinations has been reduced by a factor of 210 (i.e. 1024), & in consequence, the time required for an exhaustive search reduced proportionately.
My usual test-platform was used, to gather data for these graphs. Though from version 22.214.171.124, the application is multi-threaded, it was run using only a single thread. The measured execution-time is the time required to find the optimal solution; an acceptable solution is typically returned much more quickly.
The tests were performed by asking the application to artificially generate the file-data it normally reads from the filesystem, which permits testing with an arbitrary number of files, without requiring them to actually exist. The frequency-distribution observed using fishfood, for the size of Matroska video-files in a real file-system (mine), has a mean value of about a twelfth of the space on a DVD & a standard-deviation approximately equal to the mean file-size. "A Large-Scale Study of File-System Contents by John R. Douceur and William J. Bolosky" concludes that for arbitrary file-types, file-size matches a log-normal distribution.
The execution-time has approximately exponential O(1.7n) time-complexity with respect to the number of files. The exponential curve appears straight, since the ordinate is plotted on a logarithmic scale.
The 3-D plot has contours of constant execution-time, projected onto the base. These contours are separated by a constant time-ratio, rather than a constant absolute difference, in accordance with the logarithmic scale.
The second graph illustrates the effect on the execution-time, of the number of files (as before), & also the ratio of the total available space to the mean file-size.
Finding the optimal solution is only feasible with a small number of files or directories; though a sub-optimal solution is typically adequate & can be found much more quickly.
For a given mean file-size, the required execution-time is highly dependent on the file-size distribution. Though not depicted above, changing the assumption of a log-normal distribution of file-sizes to:
- (Space available ÷ file-size), rounded-down.
- The 0-1 Knapsack-problem assigns a value to each of the items which can become part of the solution, whereas this problem values all files equally.
- 2038 on UNIX or GNU/Linux.
- 42,676,531,852 bytes, i.e. 9.08011316 DVDs, which naturally must be rounded-up.
- (3,137,300,898 ÷ 14,132,797,728) bytes.