ensuring robustness of routine, worrying about data integrity, and no internet

alex 2nd October 2024 at 9:15am

As the code stands, I am close to having a routine that takes terms from Lute, normalises them, and matches each to a flashcard when possible. I would now like to nail this process down and make it as robust as it can be — that is, to handle errors gracefully, and not to depend on the order of actions, or other more holistic factors — I was first exposed to this idea in Grokking Simplicity, and it makes for some good practices in coding. There will be a lot of back and forth between .csv and LuteTerm, API calls, changing of data; data integrity is important.

Today, this practice feels even more poignant: I am limited in my access to the internet (thus, word definitions cannot be queried online) but it should still be possible to set some flags for term normalisation — and, more importantly, to communicate with the database, and make definite changes, even if a term doesn't undergo total normalisation (like, again: setting for a definition).

As I connect some pieces, I start noticing some code smells here and there — this is my first go at considerable-scale, information handling object-oriented programming in a long time, and I might be rusty — and standing on poor foundations. The problem at hand is twofold: not only do I want to avoid repeated database entries, but to ensure a good dialogue back and forth between my database and Lute's own.

For the repeated database entries, I thought Lute's added timestamp would be good enough — turns out it is not an unique field. Notice how the following three terms are defined:

blød,,soft (adj.),Danish,adjective,2024-09-28 20:43:40,1,,
bløde,,soft (adj.),Danish,adjective,2024-09-28 20:43:40,1,,
blødt,"blød, bløde",soft (adj.),Danish,adjective,2024-09-28 20:43:40,1,,

in this case, blødt appeared in a text, and I manually put two parent entries (blød and bløde — now I wonder whether bløde is not, too, a declension of blød? — but in any case it is possible for a word to have multiple parents: a term can be, say, both a verb conjugation and an adjective, even a noun too — passado in Portuguese, etc.). Anyway: the timestamp alone won't suffice, and so a more convoluted process happens between looking for parent information, and, in the scenario of not being able to pursue more information from my APIs, it should be possible to insert it into the database as needing further work.


It has been a rather fun day: the information flows from the Lute .csv into my database, and I'm making up criteria for matching against my Anki database as I go: it feels like setting consecutive funnels, each taking 80% of the total, and the specificity of the cases keeps dwindling down.

There are a lot of intermediate steps — I have built way too much functionality into this... — but, again — rather fun day.