You can use tee to do the sum on the fly with something like this (adapt the netcat commands for your needs):
netcat -l -w 2 1111 | tee >( md5sum > /dev/stderr )
tee >( md5sum > /dev/stderr ) | netcat 127.0.0.1 1111
Nerdwaller’s answer about using
tee to simultaneously transfer and calculate a checksum is a good approach if you’re primarily worried about corruption over the network. It won’t protect you against corruption on the way to disk, etc., though, as its taking the checksum before it hits disk.
But I’d like to add something:
1 TiB / 40 minutes ≈ 437 MiB/sec1.
That’s pretty fast, actually. Remember that unless you have a lot of RAM, that’s got to come back from storage. So the first thing to check is to watch
iostat -kx 10 as you run your checksums; in particular you want to pay attention to the
%util column. If you’re pegging the disks (near 100%), then the answer is to buy faster storage.
Otherwise, as other posters mentioned, you can try different checksum algorithms. MD4, MD5, and SHA-1 are all designed to be cryptographic hashes (though none of those should be used for that purpose anymore; all are considered too weak). Speed wise, you can compare them with
openssl speed md4 md5 sha1 sha256. I’ve thrown in SHA256 to have at least one still strong enough hash.
The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes md4 61716.74k 195224.79k 455472.73k 695089.49k 820035.58k md5 46317.99k 140508.39k 320853.42k 473215.66k 539563.35k sha1 43397.21k 126598.91k 283775.15k 392279.04k 473153.54k sha256 33677.99k 75638.81k 128904.87k 155874.91k 167774.89k
Of the above, you can see that MD4 is the fastest, and SHA256 the slowest. This result is typical on PC-like hardware, at least.
If you want even more performance (at the cost of being trivial to tamper with, and also less likely to detect corruption), you want to look at a CRC or Adler hash. Of the two, Adler is typically faster, but weaker. Unfortunately, I’m not aware of any really fast command line implementations; the programs on my system are all slower than OpenSSL’s md4.
So, your best bet speed-wise is
openssl md4 -r (the
-r makes it look like md5sum output).
If you’re willing to do some compiling and/or minimal programming, see Mark Adler’s code over on Stack Overflow and also xxhash. If you have SSE 4.2, you will not be able to beat the speed of the hardware CRC instruction.
1 1 TiB = 1024⁴ bytes; 1 MiB = 1024² bytes. Comes to ≈417MB/sec with powers-of-1000 units.
openssl command supports several message digests. Of the ones I was able to try,
md4 seems to run in about 65% of the time of
md5, and about 54% of the time of
sha1 (for the one file I tested with).
There’s also an
md2 in the documentation, but it seems to give the same results as
Very roughly, speed seems to be inversely related to quality, but since you’re (probably) not concerned about an adversary creating a deliberate collision, that shouldn’t be much of an issue.
You might look around for older and simpler message digests (was there an
md1, for example)?
A minor point: You’ve got a Useless Use of
cat. Rather than:
cat foo.box | nc <archive IP> 1234
you can use:
nc <archive IP> 1234 < foo.box
< foo.box nc <archive IP> 1234
Doing so saves a process, but probably won’t have any significant effect on performance.
In some circumstances sha1sum is faster.
It will take longer to transfer, but rsync verifies that the file arrived intact.
From the rsync man page
Note that rsync always verifies that each transferred file was
correctly reconstructed on the receiving side by checking a whole-file
checksum that is generated as the file is transferred…
Science is progressing. It appears that the new BLAKE2 hash function is faster than MD5 (and cryptographically much stronger to boot).
From Zooko’s slides:
cycles per byte on Intel Core i5-3210M (Ivy Bridge)
function cycles per byte
long msg 4096 B 64 B MD5 5.0 5.2 13.1 SHA1 4.7 4.8 13.7 SHA256 12.8 13.0 30.0 Keccak 8.2 8.5 26.0 BLAKE1 5.8 6.0 14.9 BLAKE2 3.5 3.5 9.3
You probably can’t do any better than a good hash.
You might want to check out other hash/checksum functions
to see whether any are significantly faster than
Note that you might not need something as strong as MD5.
MD5 (and things like SHA1) are designed to be cryptographically strong,
so it is infeasible for an attacker/imposter to craft a new file
that has the same hash value as an existing value
(i.e., to make it hard to tamper with signed e-mails and other documents).
If you’re not concerned about an attack on your communications,
but only a run-of-the-mill comms error,
something like a cyclic redundancy check (CRC) might be good enough.
(But I don’t know whether it would be any faster.)
Another approach is to try to do the hash in parallel with the transfer.
This might reduce the overall time,
and could definitely reduce the irritation factor
of needing to wait for the transfer to finish,
and then wait again for the MD5 to finish.
I haven’t tested this, but it should be possible to do something like this:
On the source machine:
mkfifo myfifo tee myfifo < source_file | nc dest_host port_number & md5sum myfifo
On the destination machine:
mkfifo myfifo nc -l -p port_number | tee myfifo > dest_file & md5sum myfifo
Of course checking the sizes of the files is a good, quick way to detect if any bytes got dropped.
Sending huge files is a pain. Why not try chunking up the files generating a hash for each chunk and then send it over to the destination and then check hash and join up the chunks.
You could also set up a personal BitTorrent network. That would ensure that the whole thing reaches safely.