query word information from Wiktionary

alex 4th October 2024 at 7:16am

After two (one plus one) unsuccessful attempts at manoeuvering Wiktionary's API, the third time is indeed the charm.

Unfortunately, I did not document much of my progress during coding, although test-driven development was followed, which made the process much more bearable. All of the work so far amounts to two different modules — apis/wiktionary.py and wiktionary-tests.py.

If, at first, I only had word gender retrieving at the horizon, I got comfortable enough with BeautifulSoup to extend functionality to verbs and other (yet rather untested) word roles; also, this is only tested with Danish so far. Some examples of the breadth follow:

# a query of `bo`, which is both a noun and a verb:

[{'etymology': 'From Old Norse bú, from Old Norse búa (“to reside”).',
  'type': 'noun',
  'gender': 'n',
  'definition': [
      'estate (the property of a deceased person)',
      'den, nest',
      'abode, home']},
 {'etymology': 'From Old Norse búa (“to reside”), from Proto-Germanic *būaną, cognate with Norwegian bo, bu, Swedish bo, German bauen, Dutch bouwen, Gothic 𐌱𐌰𐌿𐌰𐌽 (bauan).',
  'type': 'verb',
  'conjugation': {
      'present tense': 'bor',
      'past tense': 'boede'
      },
  'definition': [
      'to live, reside, dwell']
  }])

# a query of the verb (at) `tale`

[{"etymology": "From Old Norse tala.",
  "type": "noun",
  "gender": "c",
  "definition": ["speech, talk, address, discourse"]},
 {"etymology": "From Old Norse tala.",
  "type": "verb",
  "conjugation": {
      "imperative": "tal",
      "infinitive": "at tale",
      "present tense": "taler",
      "past tense": "talte",
      "perfect tense": "har talt"
      },
  "definition": ['to make a speech', 'to speak, talk']
  }]

Most of the heavy lifting is done by the function get_word_categories_from_subsections, which recursively traverses the respective language Wiktionary section for a given word.

For now, the definition field is what I value most and it serves my purposes (deepL, which is the source of all my translations, is better at sentence translation, not single word definitions); thus, I will halt further developments with Wiktionary parsing, and repivot my efforts.


This time I was aware of other similar endeavours — here's one, which was recently archived and doesn't seem to work properly. I could not understand whether someone had forked the code to continue elsewhere.


I'll consider this done, as most of the functionality is implemented and only missing some rounding up of edges; that should arise naturally as I keep working on the big picture.

Statusdone
Priorityhigh