johnastsang Yes 🙂
More specifically, I'm using a technique called "hashing", which is where one takes some data and runs it through a mathematical formula that generates a number based on the contents of the data. For example, a file with the word "hello" might be hashed into the number 10283818712743091273472, and a file with "hi" might be hashed into 81834782170878142087142. We can see that 10283818712743091273472 and 81834782170878142087142 are different numbers, so we know that what's in those files is different.
If we have two files that both contain just the word "hello", both will hash into the same number, 10283818712743091273472, which lets us know that the files are equal. This works well for audio files, too, even though the contents of the file isn't text.
One advantage of this technique is that we don't need to compare each file with every other file, which would take forever. Instead, we generated a list of each file's hash and then check to see if there are any duplicates hashes. If there are, then those two files are almost certainly the same 🙂
Another advantage of this technique is that it can reduce the amount of data that needs to be downloaded each time someone practices a course, using caching. To do this, we rename each audio file with its hash. For example, a file might end up being named 12798323868881212331793.mp3 . Then, we tell everyone's browser "After retrieving this file from My Little Word Land, store a copy of it, and the next time we need a file named 12798323868881212331793.mp3, just used your stored (cached) copy, instead of checking if you need to download a new copy.
If we didn't use hashing, then every time someone opened a course, their computer would need to check if it needed to download updated versions of each audio file, since the course creator might have changed one of the pieces of audio but used the same filename as an old piece of audio.
Fun stuff! 🙂