Today I came across a question. Let me share it first.
This problem was asked by Google.
Implement a file syncing algorithm for two computers over a low-bandwidth network. What if we know the files in the two computers are mostly the same?
This question actually took me back to an idea Richard had in Silicon Valley. Let me share you the clip.
Disclaimer: You may want to put your headphones on!
So, the idea of this isn't new and is already implemented.
Let me explain. But to understand this, you need to know a few things before.
Hash of each file is actually a unique one. Consider an example below
In the above example, contents of file1 and file2 differ by just one character, or by 1 ASCII. But, the hash produced by these 2 files are totally different. A good hashing algorithm ensures no 2 inputs can have the same hash. Or, no 2 files can have the same hash.
For this example, we shall be using SHA256 hashing algorithm.
Address Based Addressing
Let's say you upload your videos to YouTube. So your videos appear in the address of youtube.com/myvideos . So if you ever want to transfer your videos from YouTube to Vimeo for example, you need to transfer all your files and content from servers of YouTube to servers of Vimeo. So now, the address of your videos are in address vimeo.com/myvideos. So, more space is needed, and your videos are tied to a server and would become a hassle.
Content Based Addressing
We have already discussed hashing. And each file has a unique hash. So we can say that the hash of a file can be used as an address as each hash is unique and each address has to be unique. So even if we need to transfer our videos from YouTube to Vimeo, the address of the files can remain the same. This is what content based addressing provides us with, and this provides many benefits such as saving space, and helping us move to serverless and decentralized internet.
Peer to Peer connection
Here, different computers, PCs and other mobile devices are connected to each other in a network. So suppose if you need a file such as Example.txt, you can ask if anyone in the network has that file. If anyone has that file, they may send it over the network to the one who needs it. But for this to happen, everyone in the network needs to be trusted as anyone may pass a malicious file and entrap the user asking for the file. Hence all users in the network needs to be trusted. But if we build a decentralized internet, we need to build a peer to peer network and not everyone in the internet can be trusted.
Finally, Merkle Tress which actually solves the problems stated above. This data structure is used by many existing technologies such as Bitcoin.
Let me explain this to you in a very simple way, and this can be used to solve the question asked by Google.
We have a figure here, and let me explain you how this data structure is used.
Merkle Tree is basically a binary tree. The leaf nodes contains the files, and the parent contains the sum of hashes of its children. So A is the combination of hash of file1 and file2 and C is the sum of hash of A and B and so on. Here, C is the root node and according to Merkle Trees we call it root hash.
So, when Friend1 needs to send a file to Friend2, they only need to exchange C which is basically 32 bytes long. Now if Friend2 asks for files over the peer to peer network, and then build the merkle tree based on the files he received. If the built hash matches to the hash sent by Friend1 which is C, then the files are correct and can be trusted.
If the hash doesn't match, then it can move to its children to compare their hashed until we find the file which causes the root hash to not match.
Here, C doesn't match with F and further down B doesn't match with D. Moving on, we see that File4 doesn't match File5 and hence File4 either may have been updated of File4 may have been deleted and FIle5 is created. Eitherway, File4 and File5 needs to be dealt with.
According to question, to implement a file syncing algorithm we can implement this and whenever the root hash doesn't match, we can further investigate to find the file which has been updated and only that file will be exchanged, updated and synced.
In the above example, both the computers needs to have File5 and not File4.
[This topic is exceptionally advanced and I may explain more or even edit this post to explain this concept further. Leave a comment down below with your questions and I will surely get back to you ASAP]
For exceptional nerds: Here is a repository to the code of Merkle Tree