Hashing Files
File hashes are like digital fingerprints. If two file hashes match, it means the contents of the file are the same. Hashing is used to verify the integrity of data.
Learning Objectives
You should be able to:
- Describe the purpose of hashing files
- List popular file hashing algorithms: MD5, SHA2
- Compute the hash of files using Linux
Video Walkthrough
Use this video to follow along with the steps in this lab.
About Hashing
Hashing falls under the cryptography umbrella, but it's not encryption. Hashing shares some of the same mathematical concepts as encryption. Hashes are one-way functions. Data is sent to the hash function, and a hash digest is output. It is impossible to reverse the process. You cannot take the hash digest and reconstruct the input. Encryption algorithms are reversible, as long as you know the key.
Hash functions take data as input and output a hash, also known as a digest. A good hashing algorithm should output hashes that do not follow any recognizable patterns. Changing a single bit in the file should result in a completely different hash digest. MD5 (message digest version 5) was a popular hashing algorithm, but it was mathematically flawed. Researchers discovered patterns between the input and output and were able to "break" the algorithm. MD5 is considered broken and should not be used. The Secure Hash Algorithm version 1 (SHA1) was created as a replacement for MD5. But, researchers found that it was also flawed.
The Secure Hash Algorithm version 2 (SHA2) replaced SHA1. Today, SHA2 is a popular hashing algorithm that is still considered strong. It is widely adopted. SHA2 has longer hash digests than MD5, but SHA2's strength is in the mathematical improvements that make it more resistant to attack. SHA3 is a newer hashing algorithm that is also considered strong, but is less widely used.
Magical Blender
No matter how big the input, hashing functions always produce the same size output. It does not matter if the file is a single bit or a terabyte.

Imagine that a hash function is a magical blender. It does not matter if you add a single grape or a truckload of watermelon into the blender, the magical blender will always output 16 ounces of smoothie. Also, the blender changes the smoothie's flavor according to a repetable mathematical formula. Each smoothie will be unique as long as the ingredients differ, even if they differ slightly. Two grapes might produce a smoothie that tastes like bacon. If you make another smoothie with two grapes that match 100% (down to the molecular level), the blender will make another bacon-flavored smoothie. Three grapes might produce a smoothie that tastes like toasted watermelon. Consider the following smoothies made from this magical blender.

If we know the magical blender made those smoothies, we have no idea what the original input was. The input could have been 20 pounds of pineapple, a head of lettuce, a grain of rice, or any other food. The blender always makes 16 ounces of smoothie with a unique flavor.
Hash Files in Windows
Next, you will hash several files in Windows and analyze the output.
- Launch your Windows Server virtual machine. (You can do this on any Windows computer.)
- Download the following files by
right-clickingon them and choosing thesaveoption. If you simply click on the links, the files will be displayed in your web browser and you will feel sad that the file was not downloaded. - Open the folder where you downloaded the files.
-
Inspect each of the text files. Just double-click them to open them in Notepad.
-
Note that the contents of
essay.txtandessay_twin.txtlook very similar--but are they exactly the same? You'll use hashing to find out. - Look at
smile.pngandsmile_twin.png. They look the same. But are the files exactly the same? You'll use hashing to find out.

smile.png

smile_twin.png
Calculate Hashes with PowerShell
- With Windows Explorer open to the folder with your downloaded files, choose
File>Open Windows PowerShell. - The PowerShell prompt's working directory should be the folder with the files.
- Run the following command to compute the SHA2-256 hash of
a.txt. (On Linux, the commandsha256sum a.txtwould calculate the hash of the file.)
get-filehash a.txt
You should see the following result.
Algorithm Hash
--------- ----
SHA256 CA978112CA1BBDCAFAC231B39A23DC4DA786EFF8147C4E72B9807785AFEE48BB
- SHA2 hash will compute the same output every time it receives the same input.
- Calculate the SHA2-256 hash of
b.txt.
get-filehash b.txt
The result should be:
Algorithm Hash
--------- ----
SHA256 3E23E8160039594A33894F6564E1B1348BBD7A0088D42C4ACB73EEAED59C009D
- Notice that the input for both
a.txtandb.txtwas a single character, but the hash is64hexadecimal characters long (64*4bits per character=256bits). - Compute the hash for
long.txt.
get-filehash long.txt
The result should be:
Algorithm Hash
--------- ----
SHA256 8AC47CB4F659E638ADB4DA65E00B5214C434016EE32367E8535C9B2D22FA7386
- Even though
long.txthas a lot more content thana.txtandb.txt, the hash digest is still256bits long. You could hash a 1 gigabyte movie and the sha2-256 hash would still be256bits long. This should give you some evidence that if given a hash digest, you could not tell if the input was small or large, much less the content of the original hash. - Check the hashes of
essay.txtandessay_twin.txt.
get-filehash essay.txt
get-filehash essay_twin.txt
- Are the file contents the same? The hashes should look very different which indicates that the file contents are not the same.
- You can try to find the difference manually by carefully reading each file to look for changes. Linux and Windows both have tools for comparing text files. Run the following command to check for differences in text files line by line.
diff (cat essay.txt) (cat essay_twin.txt)
- The
diffresult shows that one of the files says, "produce" and the other says, "product." So in this case, hashing told us that the file contents were not the same. But file hashing cannot tell us what has changed. For text files, thedifftool can be helpful.
InputObject SideIndicator
----------- -------------
to the overall produce as =>
to the overall product as <=
- The two smiley faces look the same, but unless we hash the files we cannot know for sure. Calculate the SHA2-256 hashes.
get-filehash smile.png
get-filehash smile_twin.png
The hashes should match, indicating that every single bit in each of the files matches.
File Hashing Use Cases
File hashing is used in the following cases:
- When syncing files between computers, programs can check hashes to see if files have been modified since the last time files were synced.
- When law enforcement seizes phones and computer equipment, hashes of all files will be taken before forensic analysis begins. This ensures that law enforcement can determine if any files were modified by investigators. Without hashes and the proper chain of custody, digital evidence would not be admissible in court.
- When uploading files to cloud computing services, you sometimes have to send a hash of the file to ensure that the cloud provider received the file intact.
- Some websites publish the hash of files that you download. You can verify the hash to make sure that nobody inserted malware into the file you downloaded.
Challenge
- Edit
smile.png. Change one of the pixels to a slightly different color. After saving the file, recalculate the hash.
Reflection
- How might a website like Pinterest use file hashing to detect duplicate images?
- How might a website like Facebook use hashing to determine if content violates its guidelines or the law?
Key Terms
- File Hash: A unique fixed-size string or number generated from the contents of a file using a hash function. It serves as a digital fingerprint of the file, allowing for the verification of file integrity and detection of any changes or corruption. Common hash functions include MD5, SHA-1, and SHA-256.
- PowerShell - Get-FileHash: A PowerShell cmdlet used to compute the hash value of a file. It supports various hash algorithms, such as MD5, SHA-1, and SHA-256. The cmdlet is useful for verifying file integrity and ensuring that files have not been tampered with.
- Linux - sha256sum: A command-line utility in Linux used to compute and verify SHA-256 hash values of files. It reads the file content and generates a SHA-256 hash, which can be used to check the file's integrity.