query word from Wiktionary API using section param

alex 15th September 2024 at 9:03am

[September 15th, 2024: this page lightly documents another averted exploration with regard to ankigarden's Wiktionary API handling. TLDR: I had tried using the definition dedicated route from the v1 API, but it was rather limited; moved into v2, but came to the conclusion that parsing the webpage HTML was a more robust solution.]

I re-read the v2 API documentation and it makes much more sense now. action is clearly parse.

base_url = "https://en.wiktionary.org/w/api.php
params = {
        "action": "parse",
        "page": page_title,
        "format": "json",
        "utf8": 1,
        "formatversion": 2,
}

with prop: sections, comes a list with something akin to a table of contents:

{'parse': {'pageid': 40981,
           'sections': [{'anchor': 'Danish',
                         'byteoffset': 0,
                         'fromtitle': 'cykel',
                         'index': '1',
                         'level': '2',
                         'line': 'Danish',
                         'linkAnchor': 'Danish',
                         'number': '1',
                         'toclevel': 1},
                        {'anchor': 'Etymology',
                         'byteoffset': 12,
                         'fromtitle': 'cykel',
                         'index': '2',
                         'level': '3',
                         'line': 'Etymology',
                         'linkAnchor': 'Etymology',
                         'number': '1.1',
                         'toclevel': 2},
                        {'anchor': 'Pronunciation',
                         'byteoffset': 82,
                         'fromtitle': 'cykel',
                         'index': '3',
                         'level': '3',
                         'line': 'Pronunciation',
                         'linkAnchor': 'Pronunciation',
                         'number': '1.2',
                         'toclevel': 2},
                        {'anchor': 'Noun',
                         'byteoffset': 178,
                         'fromtitle': 'cykel',
                         'index': '4',
                         'level': '3',
                         'line': 'Noun',
                         'linkAnchor': 'Noun',
                         'number': '1.3',
                         'toclevel': 2},
                        {'anchor': 'Inflection',
                         'byteoffset': 286,
                         'fromtitle': 'cykel',
                         'index': '5',
                         'level': '4',
                         'line': 'Inflection',
                         'linkAnchor': 'Inflection',
                         'number': '1.3.1',
                         'toclevel': 3},
                        {'anchor': 'Derived_terms',
                         'byteoffset': 430,
                         'fromtitle': 'cykel',
                         'index': '6',
                         'level': '4',
                         'line': 'Derived terms',
                         'linkAnchor': 'Derived_terms',
                         'number': '1.3.2',
                         'toclevel': 3},
[...]

(Norwegian comes next.)

It is necessary to identify the target language's level (a language rests at level: 2) and retrieve its number (eg. filter anchor="Danish" and level=2).

Then, simple cases are straightforward; but for bil, for example,

                        {'anchor': 'Danish',
                         'byteoffset': 755,
                         'fromtitle': 'bil',
                         'index': '14',
                         'level': '2',
                         'line': 'Danish',
                         'linkAnchor': 'Danish',
                         'number': '4',
                         'toclevel': 1},
                        {'anchor': 'Etymology_3',
                         'byteoffset': 767,
                         'fromtitle': 'bil',
                         'index': '15',
                         'level': '3',
                         'line': 'Etymology',
                         'linkAnchor': 'Etymology_3',
                         'number': '4.1',
                         'toclevel': 2},
                        {'anchor': 'Pronunciation_3',
                         'byteoffset': 882,
                         'fromtitle': 'bil',
                         'index': '16',
                         'level': '3',
                         'line': 'Pronunciation',
                         'linkAnchor': 'Pronunciation_3',
                         'number': '4.2',
                         'toclevel': 2},
                        {'anchor': 'Noun_2',
                         'byteoffset': 956,
                         'fromtitle': 'bil',
                         'index': '17',
                         'level': '3',
                         'line': 'Noun',
                         'linkAnchor': 'Noun_2',
                         'number': '4.3',
                         'toclevel': 2},
                        {'anchor': 'Declension',
                         'byteoffset': 1013,
                         'fromtitle': 'bil',
                         'index': '18',
                         'level': '4',
                         'line': 'Declension',
                         'linkAnchor': 'Declension',
                         'number': '4.3.1',
                         'toclevel': 3},

it is clear it is then necessary to match the number field of parent and a level of parent level + 1 — that exposes different definitions (it could be a noun, adjective, verb, etc); and to match, one should look at line — not anchor, as shown above.


This was a dead-end, because, as rightly pointed out in the Wiktionary API discussion, it is not worth it to run two requests, especially given that I must manually parse the contents of it anyway.

Statusaverted
Priorityhigh