This page describes criterias for creating an evaluation corpus for a document and ontology-based information system.

Corpus Information

Possible domains


These are basic requirements

  • document corpus
    • single domain
    • different lengths (pages)
    • different types (news ticker, article, book, website)
    • at least 100 documents
    • different creation dates (time aware)
  • thessaurus
    • may link to wordnet
    • synonyms
    • akronyms
  • ontology
    • single namespace
    • annotations (synsets, ...)
  • domain ontology
    • describes the domain of the document corpus
    • contains taxonomy of classes
    • contains taxonomy of possible relations between classes
    • inverse relations are needed
    • OWL as language
    • named graphs as technique (reification)
    • allows creation of complex but speaking queries
  • instance base
    • contains annotations of document corpus
    • high density of relations between instances
    • high and uniform covering of classes and relations
    • each document is an instance
