Data file formatsThis section could well be renamed "the virtues of binary IO" since thats what will be preached here. For concreteness let's consider the following segments out of a Fortran program:parameter(n=10 000 000) real*8 c(n) ! Text from loop: open(unit=9,file='loop_text.dat',form='formatted') do i = 1,n write(9,*) c(i) enddo close(9) ! Block text: open(unit=10,file='block_text.dat',form='formatted') write(10,*) c close(10) ! Binary: open(unit=11,file='binary.dat',form='unformatted') write(11) c close(11)IO speed
The timings (in mm:ss) for the above IO on the VPP to appropriate disks and filesystems are:Disk use
IO method /short /vflvol cputime elapsed cputime elapsed Text from loop 28:06.9 35:36.6 27:57.1 28:41.0 Block text 16:31.7 17:08.7 16:21.5 16:53.3 Binary 4.6 17.3 0.4 6.9 Even on an inappropriate disk, binary IO is 50 times faster than formatted IO and more than 300 times faster to fast disks. Even if you dont have such large arrays to write out, you can still expect orders-of-magnitude improvements by using binary IO.
The writing of binary data is basically a bit dump from memory to disk - apart from orchestrating addresses and device offsets to get things started and possibly cleanup, the cpu has virtually no role. In contrast, writing ascii data involves a considerable amount of bit rearrangement and logical operations to convert from binary format. Eradicating the loop by using the single formatted IO statement only removes the overhead of so many IO calls - this still leaves 60% of the cputime.
Since you are unlikely to want to read 10,000,000 numbers, the major difference between loop_text.dat and binary.dat is about a factor three to four in file size!Using binary filesIn a text (or ascii) file, each character requires 1 byte (8 bits) of storage. A free format double precision number is probably of the form
-0.1234567890123456d-123,
a total of 24 characters. On many systems, there will be blank characters added to each line each taking 1 byte of storage as well as an end-of-record marker. So it is easy to see how an 8 byte number can take up nearly four times as much space on a disk. And this is the good case: there can be up to 16 digits for a 4-byte real variable in free format!In binary.dat, each number is exactly the 64 bits that it was in memory. Fortran includes a leading and trailing 4 or 8 bytes on every record in a sequential access file (as record delimiters and providing the record length) but there is only one record in binary.dat so only an extra 8 bytes.
Note that block_text.dat will only be marginally smaller than loop_text.dat -- all that has been saved is the single-character end-of-record marker on each line.
The obvious inconveniences of binary files is that you cant see whats in them and so its not always obvious how to read them. What is needed is some form of file content or file description. Three possible ways of doing this are:If you have to use ascii ...
- Write some character strings to the file giving the file description before writing the binary data. Those strings will be readable if you list the file. This method is messy.
- Write an brief auxilliary file describing the format of the binary file and always keep the files together as a pair.
- Use one of the available self-describing file format packages (netCDF is one such package). All file IO goes through a library and the package provides routines to inquire about the file content and format.
Of course, unformatted binary output is not always applicable. If you really do want to read the output it should be text but be reasonable about what you will and will not read. Either:
- keep it to a minimal summary by writing sparingly and using formats like f7.4 to minimize the number of characters if you really want to preserve the output file or
- remember to delete detailed logfiles of a run after you have deduced or extracted the necessary results.
If the output from one executable is to be input for another executable on the same system then an unformatted intermediate file is usually best. Even if the data will be input on another system, it is likely that both use IEEE and are compatible. If the second application is something that requires ascii input, consider what is a sensible input format. For example, variations less than 1 in 10000 are not noticed in graphics so f7.4 is probably sufficient for a graphics package. Most packages that handle large datasets will allow binary input.