molybdenum-99/infoboxer

View on GitHub
Parsing.md

Summary

Maintainability
Test Coverage
Parsing Wikipedia is not an easy tasks. Some tags and formattings signs
can be only after newline, some can be everywhere in text; some formatting
can span several lines, some is force-closed on line end; there can be
tons and tons of markup inside image captions, templates and <ref>'s, so...
Here's what I've came with:

1. Entire page text is split into lines (after replacing of `<!-- -->`
  comments -- they go nowhere).
2. First, we are in *paragraph* context. We are looking at next line in
  list and guessing what it is: list, heading and so on
3. Then, we are in *inline* context for text of paragraph (unless it is
  table, which is different story, and headings, which also different,
  and of course preformatted text,... you've got the idea). We scan text
  until *any* of inline formatting will be met (or end of line).
4. When met with some formatting, we push current context and scan inside
  it. The inline scanning is tricky!
  * Simple formatting like `''` (italic) is implicitly closed at the end
    of line (it is called "short inline scan" inside Infoboxer's parser)
  * Long formatting like templates can span several lines, so we continue
    scan through next lines, till template end (it means we are still in
    same paragraph!), it's "normal inline scan", or just "inline scan"
  * Some __inline__ formatting (like `<ref>`'s) and special formatting,
    like table cells, can have other paragraphs inside! (But it's still
    "inline" formatting, because when <ref> is ended, the same paragraph
    is continued -- while showing it in Wikipedia, ref will leave a small
    footnote mark in paragraph, and the contents will be below). We call
    such a cases "long inline scan".
5. So, parser tries to do everything in one forward scan, without returning
  to previous positions or tricks like "scan all symbols till the end of
  template, then parse them as a separate sub-document" (the letter is
  the simplest way to parse MediaWiki markup; that's how Infoboxer worked
  at first; it was not very fast and not memory-effective at all).