[September 15th, 2024: this page lightly documents another averted exploration with regard to ankigarden's Wiktionary API handling. TLDR: I had tried using the definition dedicated route from the v1 API, but it was rather limited; moved into v2
, but came to the conclusion that parsing the webpage HTML was a more robust solution.]
I re-read the v2 API documentation and it makes much more sense now. action
is clearly parse
.
base_url = "https://en.wiktionary.org/w/api.php
params = {
"action": "parse",
"page": page_title,
"format": "json",
"utf8": 1,
"formatversion": 2,
}
with prop: sections
, comes a list with something akin to a table of contents:
{'parse': {'pageid': 40981,
'sections': [{'anchor': 'Danish',
'byteoffset': 0,
'fromtitle': 'cykel',
'index': '1',
'level': '2',
'line': 'Danish',
'linkAnchor': 'Danish',
'number': '1',
'toclevel': 1},
{'anchor': 'Etymology',
'byteoffset': 12,
'fromtitle': 'cykel',
'index': '2',
'level': '3',
'line': 'Etymology',
'linkAnchor': 'Etymology',
'number': '1.1',
'toclevel': 2},
{'anchor': 'Pronunciation',
'byteoffset': 82,
'fromtitle': 'cykel',
'index': '3',
'level': '3',
'line': 'Pronunciation',
'linkAnchor': 'Pronunciation',
'number': '1.2',
'toclevel': 2},
{'anchor': 'Noun',
'byteoffset': 178,
'fromtitle': 'cykel',
'index': '4',
'level': '3',
'line': 'Noun',
'linkAnchor': 'Noun',
'number': '1.3',
'toclevel': 2},
{'anchor': 'Inflection',
'byteoffset': 286,
'fromtitle': 'cykel',
'index': '5',
'level': '4',
'line': 'Inflection',
'linkAnchor': 'Inflection',
'number': '1.3.1',
'toclevel': 3},
{'anchor': 'Derived_terms',
'byteoffset': 430,
'fromtitle': 'cykel',
'index': '6',
'level': '4',
'line': 'Derived terms',
'linkAnchor': 'Derived_terms',
'number': '1.3.2',
'toclevel': 3},
[...]
(Norwegian comes next.)
It is necessary to identify the target language's level (a language rests at level: 2
) and retrieve its number (eg. filter anchor="Danish"
and level=2
).
Then, simple cases are straightforward; but for bil
, for example,
{'anchor': 'Danish',
'byteoffset': 755,
'fromtitle': 'bil',
'index': '14',
'level': '2',
'line': 'Danish',
'linkAnchor': 'Danish',
'number': '4',
'toclevel': 1},
{'anchor': 'Etymology_3',
'byteoffset': 767,
'fromtitle': 'bil',
'index': '15',
'level': '3',
'line': 'Etymology',
'linkAnchor': 'Etymology_3',
'number': '4.1',
'toclevel': 2},
{'anchor': 'Pronunciation_3',
'byteoffset': 882,
'fromtitle': 'bil',
'index': '16',
'level': '3',
'line': 'Pronunciation',
'linkAnchor': 'Pronunciation_3',
'number': '4.2',
'toclevel': 2},
{'anchor': 'Noun_2',
'byteoffset': 956,
'fromtitle': 'bil',
'index': '17',
'level': '3',
'line': 'Noun',
'linkAnchor': 'Noun_2',
'number': '4.3',
'toclevel': 2},
{'anchor': 'Declension',
'byteoffset': 1013,
'fromtitle': 'bil',
'index': '18',
'level': '4',
'line': 'Declension',
'linkAnchor': 'Declension',
'number': '4.3.1',
'toclevel': 3},
it is clear it is then necessary to match the number
field of parent and a level of parent level + 1 — that exposes different definitions (it could be a noun, adjective, verb, etc); and to match, one should look at line
— not anchor
, as shown above.
This was a dead-end, because, as rightly pointed out in the Wiktionary API discussion, it is not worth it to run two requests, especially given that I must manually parse the contents of it anyway.
Status | averted |
---|---|
Priority | high |