Just got back from my Christmas/New Year/birthday vacation! πŸ™‚

I've begun running a script that's downloading unlisted courses. This should download courses that:

  • Were created by Memrise and later unlisted
  • Were beyond the 666 per-category page limit
  • Were recently switched from private to public

I expect this script to take about 1 week to fully run, since it has to check a lot of courses (about 2.5 million). Currently, it looks like about 1/5 of all courses were unlisted! o_o

I ran the script overnight, and it checked about 250,000 courses, so it may take a couple of weeks to download everything πŸ™‚

Interestingly, the number of courses that are being discovered and downloaded rose from 20% to 70%! o_o

  • DW7 likes this.

NewLandRise I don't think there's an officially published number, but according to this post, it looks like the course numbers go up to 6,500,000 or so: https://forum.mylittlewordland.com/d/57-unlisted-official-arabic-courses/5

Some of the courses have been deleted or are private, but I'd expect at least 3-4 million courses or so to be public.

I continued running the script last night, and it looks like it's able to pretty consistently scan 300k courses per night. For 6,500,000 courses, that'd take a little over 20 nights, and that's just for downloading the basic course information (name, category, author, description, etc.). After that, I'd need to download each level's vocabulary and audio/images, so we'd probably be looking at about 40 evenings of script running, plus processing time of perhaps a week. That's pretty long, so I probably need to start also running the script during the day or increasing the number of courses that are downloaded in parallel. Currently, I'm downloading 5 in parallel πŸ™‚

    I have more questions if you don't mind:

    1.) How many courses you were able to discover (and download) from the public courses categories?

    2.) I assume that you are now accessing the courses through the course ID (from 1 to 6537036). Are you able to retrieve the status of the course that the course is set to? By status I mean that when the course is created it can be set in 3 different states: Incomplete, Unlisted or Public. Did you find a way how to retrieve this information?
    I guess there must be tons of courses that are half finished and still in the Incomplete/Unlisted status. And Memrise doesn't have implemented any access control to Incomplete/Unlisted courses. So anyone can access any course. How will you filter out such courses?

      neoncube The Eltaurus' script became too heavy to be of practical use when the number increased in batch mode, so I modified it in my own way.

      https://github.com/7shi/CourseDump2022

      As far as I tried, I could download 500,000 files of 3,000 courses at once in batch mode. (It seems that if I increase the number of courses any more, the V8 Engine crashes in the process.)

      First of all, I made a pull request for the part about controlling the number of simultaneous connections, but it has not been merged yet.

      https://github.com/Eltaurus-Lt/CourseDump2022/pull/38

      Note: I'm afraid of getting banned for overloading, so I am refraining from using this script at this time.

        7shi Hm, interesting…

        If I try to download more than about 5 courses at a time (e.g. just downloading the HTML of https://app.memrise.com/community/course/<id>/course-name/), I start to get this response:

        <html>
        <head><title>502 Bad Gateway</title></head>
        <body>
        <center><h1>502 Bad Gateway</h1></center>
        <hr><center>nginx</center>
        </body>
        </html>

        I'd assumed I was overloading the Memrise servers and backed back down to 5.

        It looks like the Memrise audio and image files are stored on Amazon S3, though, and I can download about 50 of those files in parallel before getting 502 errors πŸ™‚

        I wonder if I'm being rate limited because of the large number of courses and S3 files that I've downloaded over the past few weeks.

        Either way, I'm happy just downloading 5 courses in parallel for a week or so. Hopefully that'll be lighter on the Memrise servers, too πŸ™‚

          neoncube 502 occurs rather frequently. If I wait a little and try again, it almost always succeeds. I have included a countermeasure code for this.

          async function fetchRetry(url, options, retries = 3, interval = 1000) {
          let ret;
          for (let i = 0; i < retries; i++) {
          if (i) console.log("retry", i);
          await sleep(i ? interval : 200);
          try {
          ret = await fetch(url, options);
          if (ret.status != 502) break;
          } catch (e) {
          if (i == retries - 1) throw e;
          }
          }
          return ret;
          }

            NewLandRise I was able to discover around 220,000 courses from the public courses categories πŸ™‚

            I'd totally forgotten that Memrise lets one set a course to "Incomplete" or "Unlisted"! Indeed, I'm just scanning the courses from 1 to 7,000,000'ish, and this might explain why I'm hitting so many courses that weren't listed via the public course categories! πŸ™‚

            I just took a quick look at an unlisted course, and it doesn't look like it has anything to indicate that it's unlisted. I can think of a couple of ways that we might be able to determine whether a course was incomplete/unlisted or not present on the course categories pages because it wasn't popular enough to be in the first 666 pages of a category:

            1. Check to see if the course belongs to a category that at least 666 pages. If not, then the course should be incomplete/unlisted.
            2. Download each user's profile and compile a list of all courses taught by all users. If a course is on this list, then it must be public. I'm not sure if this would work for courses where the author has already deactivated their account, though.

              7shi Thanks πŸ™‚

              Let me take another look at my code. When I'm downloading 5 courses in parallel, I get a 502 once in every, say, 10,000 courses, but if I download 6-7 courses in parallel, I get a 502 perhaps 30% of the time. Let me check if NodeJS supports persistent connections πŸ™‚ Perhaps the issue is that I'm constantly closing and reopening connections.

              7shi That's pretty awesome they intentionally waited for that number, haha XD

              It looks like enabling keep-alive fixed my 502 issue! ^_^ Thanks very much for pointing me in the right direction πŸ™‚ This should make the Memrise servers a lot happier, too, I think, since they won't have to keep accepting and closing connections πŸ™‚

              May I make a suggestion - I am a little concerned that incomplete and unlisted courses may be of no value (and the author who has set it as private, might not want it copied).

              Basically with so many courses I hope people will be able to find the best (or most popular) courses.

              Perhaps you could make it clear or even allow a filter that excludes incomplete and private courses.

              I went looking for mine and without a function to search by a key word or author or text in the description field (I add "looked after by DW7" to make it easy for me and others to find courses and know that support is available) it will be hard for me to find them.

              DW7

              At present many are still missing.

              This Β» one Β« should be under Literature.

              And these Β» ones Β« should be under "Religion - Christianity".

              Hi @[deleted], could you let me know if you have been able to upload these courses - I still can't find them (along with several Art courses I look after).


              I noticed that a lot of the courses I created were not showing with the MemRise search by DW7 so I have added "Produced by DW7" to those courses, so a fresh Β» search Β« should list them all - some 275.


              This one is also missing 7 Wonders ~ Ancient, Modern & Nature - by simone.zanotti - Memrise (under "Places" - not to be confused with the *100 version)

              At the request of some people on Reddit, I've added most Kurdish courses and some Japanese courses πŸ™‚

              @DW7 I don't think those courses have been uploaded yet, but once they are, they should automatically show up in your dashboard, so you won't need to search for them πŸ™‚ However, adding a search feature is definitely still on the to-do list! πŸ™‚

              Ya, once I found out that most of the courses that weren't listed in the categories were intentionally unlisted, I stopped planning to automatically add those courses to My Little Word Land, haha. God willing, I'm probably just going to add just the unlisted courses created by Memrise, since people have been requesting those, and then to keep copies of the other courses somewhere, in case someone asks for them in the future πŸ™‚

              • 7shi replied to this.
              • DW7 likes this.

                Good morning @[deleted],

                once they are, they should automatically show up in your dashboard,

                That's a new feature - brilliant - fantastic idea.

                Incidentally I looked at the *100 course and you offer three options for testing whereas I had four (ie two plus reversed).

                Location vs Name of image (eg Great Wall)
                Name of image vs Photo

                Then reversed.

                Also possible Name vs Location and reversed.

                (I have an image but the image button says Link! - PS but drag and drop worked.

                Image description