How will audio-only levels be migrated?

johnastsang

Some courses features audio-only levels, such as https://app.memrise.com/community/course/206965/spanish-complete-course-full-audio/6/. Now that MLWL does not support audio yet, the audio files appear as "[objectObject]" (see https://mylittlewordland.com/course/3711).

Audio files that accompany words should work fine after the implementation of audio. But since MLWL supports two-way testing (in this case to and from Spanish and English), it will break in audio-only levels (from Spanish audio to English is okay, but the other way will be a bug). How should this be solved?

ADOMEUPORG

Thank you for making this post. I am teaching Thai and I can't imagine how my students could ever speak and understand the language without hearing the pronunciation of the words they are learning.

Having audios is crucial, but I guess it requires a lot of storage space which might be expensive for the site's owner.

johnastsang

ADOMEUPORG Hi! It's nice seeing you here, coz I've used your Thai Duolingo course and I love it!

https://app.memrise.com/community/course/2179926/thai-ln/

neoncube

I've just released support for pictures and am getting ready to also implement audio 🙂

johnastsang

neoncube How exciting! Hats off!

neoncube

johnastsang Thank you! ^_^

ADOMEUPORG

Awesome ! I can't wait for it

@neoncube I don't know anything about site hosting but will the cost rise because of the storage needed for images and audios?

Do you think this will be sustainable long term?

neoncube

ADOMEUPORG It'll cost more, but I'm storing the audio and images in Amazon's cloud, and honestly, the storage costs for that are pretty cheap, relatively speaking 🙂

So far, there are 134GB of images and 272GB of audio (1 million images and ~10 million audio files for about 200,000 courses). It turns out that a lot of audio files are duplicates (probably downloaded from Forvo?), so there's perhaps 200GB of unique ones.

Amazon cloud costs are $0.023 per GB, so for 400GB, that's about 9USD per month. Bandwidth might be another 10USD, and the database, backend code, etc. are about 15USD, I think, for a total of about 25USD per month, which I think is very reasonable for a site like this 🙂

johnastsang

neoncube It turns out that a lot of audio files are duplicates

Again I'm just curious - how did you find out the duplicates? By running a script that compares the information in each audio file?

neoncube

johnastsang Yes 🙂

More specifically, I'm using a technique called "hashing", which is where one takes some data and runs it through a mathematical formula that generates a number based on the contents of the data. For example, a file with the word "hello" might be hashed into the number 10283818712743091273472, and a file with "hi" might be hashed into 81834782170878142087142. We can see that 10283818712743091273472 and 81834782170878142087142 are different numbers, so we know that what's in those files is different.

If we have two files that both contain just the word "hello", both will hash into the same number, 10283818712743091273472, which lets us know that the files are equal. This works well for audio files, too, even though the contents of the file isn't text.

One advantage of this technique is that we don't need to compare each file with every other file, which would take forever. Instead, we generated a list of each file's hash and then check to see if there are any duplicates hashes. If there are, then those two files are almost certainly the same 🙂

Another advantage of this technique is that it can reduce the amount of data that needs to be downloaded each time someone practices a course, using caching. To do this, we rename each audio file with its hash. For example, a file might end up being named 12798323868881212331793.mp3 . Then, we tell everyone's browser "After retrieving this file from My Little Word Land, store a copy of it, and the next time we need a file named 12798323868881212331793.mp3, just used your stored (cached) copy, instead of checking if you need to download a new copy.

If we didn't use hashing, then every time someone opened a course, their computer would need to check if it needed to download updated versions of each audio file, since the course creator might have changed one of the pieces of audio but used the same filename as an old piece of audio.

Fun stuff! 🙂

johnastsang

neoncube Interesting! TIL! 👍

neoncube

johnastsang Ya, it's pretty cool stuff! 🙂

This is the first time that I've worked on a website with this much data, so I've gotten to learn some things, too! 🙂