after a slight digression, nailing down the code

alex 29th September 2024 at 3:36pm

It has been a while since the last post on Lute. But this has always been on my mind (although I did find a very similar tool to what I am trying to achieve — maybe I'll expand on that later).

In any case — here's the news:

Lute term import is already handled into a database

I might have overengineered some things, but there are three new objects: LuteEntry — the base class — NormalizedLuteEntry, that expands the previous and includes some methods for normalization of data (things like lowercasing of all terms, removal of useless tags, deciding whether a given term is a parent or not, etc. ——

and this is where I got a tiny bit frustrated with my work — VocabSieve does this out of the box, and so I took a few days off to understand whether I would prefer using that tool or finishing the work here — so there's that ——

and then, LuteTableEntry, which corresponds to a SQL scheme: the middleman between my Anki database and the Lute terms file.

As for normalisation, some things are already happening

Whenever Lute terms are imported, they must undergo some slight changes so as to bring them closer to my own Anki information — this will also make some later data processing, if it ever comes to that, a little easier.

A first run through all the new imports will signal whether there any problems that need fixing:

class NormalizedLuteEntry(LuteEntry):
    must_get_part_of_speech: bool = False
    must_get_gender: bool = False
    must_get_parent: bool = False
    must_clean_ion_tag: bool = False
    normalization_log: List[Dict[str, str]] = field(default_factory=list)

    def log_change(self, method: str, field: str, original: str, normalized: str, fixed: bool = True):
        self.normalization_log.append({
            "method": method,
            "field": field,
            "original": original,
            "normalized": normalized,
            "fixed": fixed
        })
		
		[...]

Overenginered? Maybe. But it was sort of fun.

The most important fix is getting the part of speech: this distinguishes whether a word is a noun, verb, adjective, etc.; and it is important to figure out further customisation of flashcards.

    def fix_logged_problems(self):
        if self.must_get_part_of_speech:
            # TODO later this will be shielded by an API call
            categories = get_word_definition(self.term, "Danish")
            if categories:
                self.tags += ", ".join(list(map(lambda cat: cat["type"],
                                                categories)))
                # TODO 'conjugation' should be removed in this case
                # TODO create parent entry if there is none
                self.parent += ", ".join(list(map(lambda cat: cat["parent"],
                                                filter(lambda cat: 'parent' in cat,
                                                       categories))))
                part_of_speech_log = next(filter(lambda log: log["field"] == "must_get_part_of_speech",
                                            self.normalization_log))
                # NOTE I don't like this — mutability is iffy. Should be a proper copy.
                new_log = part_of_speech_log
                new_log["fixed"] = True
                self.normalization_log.remove(part_of_speech_log) 
                self.normalization_log.append(new_log)

At this point, it was necessary to merge some work done on the wiktionary querying, which was almost done and ready to commit to the master branch. Whenever there is no information about a given word, Wiktionary is queried and the term is automatically assigned a category (or many).

After normalising the term and fixing whatever problems it had, one can think about matching it to any existing flashcard, if there is any. Luckily, AnkiConnect is great at bridging between the query mechanisms of Anki itself. All I'm doing is checking if there are exact matches between two fields (which is highly unlikely), and then exact matches on just one of the fields (in this case, a word in Danish might match, but have two different translations, even if almost synonymous); the other cases are more flexible matches. If there is a match — ideally, a unique one — the Anki card ID will be associated with the LuteEntry.

I think there are still a few problems here; but as long as I understand I'm also learning and trying to improve my workflow, decision making, etc. — then this whole ordeal is slightly more appealing.