query word definitions from Wiktionary via their version 1 API

alex 15th September 2024 at 8:42am

[Note: this was written in the 29th of August, 2024; also, this uses an older route of the Wiktionary API — v1 — and I won't pursue it further, as it ommits gender information from nouns, among other things I need; oh — and at some point I thought this could be a good primer on test-driven development; now I am not so sure. So be wary — this is a dead-end!]

Stating the problem

I am working on retrieving some information from Wiktionary's API. The following is the response I get when querying page/definitions/trivsel (where trivsel is the work in Danish), and I want to get all possible definitions, and even maybe use the partOfSpeech field, as that can be handy too.

API_RESPONSE = [
{'definitions': [{'definition': '<a rel="mw:WikiLink" href="/wiki/well-being" '
                                 'title="well-being">well-being</a>'},
                  {'definition': '<a rel="mw:WikiLink" href="/wiki/growth" '
                                 'title="growth">growth</a>, <a '
                                 'rel="mw:WikiLink" href="/wiki/prosperity" '
                                 'title="prosperity">prosperity</a>'}],
  'language': 'Danish',
  'partOfSpeech': 'Noun'}]

(and, for the sake of completeness, here is the full information on the Wiktionary; notice how the elusive word gender information is present, but it's not retrieved by the API, unfortunately:)

Devising the first test

With TDD, I am forced to think ahead: how exactly do I want this information to be presented? Maybe something like this:

[{"noun": ["well-being", "growth, prosperity"]}]

and, thus, I could imagine a very handy function (that doesn't exist yet), whose name could be, huh — parse_term_definition; and it would receive an API response as the aforementioned, to yield something like

{
	"query": "trivsel",
  "definitions": [{"noun": ["well-being", "growth, prosperity"]}]
}

Well, then! Our first test:

    def test_get_danish_term_definition(self):
        # NOTE an alternative to hard-coding the API_RESPONSE would be
				# using a request-cache, so as not to overload the API or,
				# (arguably worse) depend on an internet connection for development
				API_RESPONSE = [...]

        EXPECTED = {
                "query": "trivsel",
                "definitions": [{"noun": ["well-being", "growth, prosperity"]}]
                }
        self.assertEqual(parse_term_definition(API_RESPONSE, "da"), EXPECTED)

So: why is this useful? First, thinking ahead is good; it helps put aside the worry of deciding where to go: the goal is now very clear. And second, it (ideally) encapsulates an environment for repeated tries at a piece of code, and thus it shortens the feedback loop between a code change and watching its effects.

At this point, since parse_term_definition doesn't exist, the next step is to define it in the quickest, hackiest possible way so that it passes the test (some would argue that hardcoding the answer is a viable next step — and this might make sense in more convoluted environments, where there are big interdependecies among functions; but it's certainly not our case).

A possible next step, though, is to further delay this definition, and think about how to further delegate computational work, so to say, to even smaller functions (this is not necessarily TDD, but rather striving to have atomic functions that do only one simple task). The, one could break the parse_term_definition even further: there's some HTML in need of parsing. I can see that both Greek and Danish responses follow the same structure:

# from Greek's που
[{'definitions': [{'definition': '<a rel="mw:WikiLink" href="/wiki/that" '
                                 'title="that">that</a>', [...]
# from Danish's trivsel
[{'definition': '<a rel="mw:WikiLink" href="/wiki/well-being" '
                                 'title="well-being">well-being</a>'},
								'<a rel="mw:WikiLink" href="/wiki/growth" '
                                 'title="growth">growth</a>, <a '
                   'rel="mw:WikiLink" href="/wiki/prosperity" '
                                 'title="prosperity">prosperity</a>'}]

and so the framework for an inner function, parse_html_from_definition, could be devised via tests, too:

    def test_parse_html_from_definition(self):
        DANISH_DEFINITION = '<a rel="mw:WikiLink"
                                href="/wiki/well-being" 
                                title="well-being">well-being</a>'
        EXPECTED = "well-being"
        self.assertEqual(parse_html_from_definition(DANISH_DEFINITION, EXPECTED))

Ensuring test coverage

...but at this point a good question could arise: how to ensure (full) correctness of the code? Well, that's very difficult. It is much easier to prove something is wrong (via a counter-example) than otherwise (and mathematics is very much concerned with things like this). On the other hand, there's something very well within our reach — that would be to expand the tested range of words as much as possible, trying to amount for different (in the, uh, abstract sense) queries.

Notice, for example, how the definition for trivsel can be trickier than the Greek word; it has in fact \\two entries\\, instead of one:

{'definitions': [{'definition': '<a rel="mw:WikiLink" href="/wiki/well-being" '
                                 'title="well-being">well-being</a>'},
                 {'definition': '<a rel="mw:WikiLink" href="/wiki/growth" '
                                 'title="growth">growth</a>, <a '
                                 'rel="mw:WikiLink" href="/wiki/prosperity" '
                                 'title="prosperity">prosperity</a>'}],

Then our tests should include not only cases for one entry in definition, but also two — and ideally three, if that ever happens, etc.


I'll end my post here. This is by no means a guide on test-driven development (oh my!) — just a quick account on how I would structure an approach to a problem, that came about when I was, in fact, trying to solve one.

Statusdone
Priorityhigh