wiki:Evaluation/DocumentClassification

The motivation for this evaluation was to get an idea of how automatically providing an iDocument user with templates according to the document type could be achieved. For this matter, we chose a machine learning based approach of classifying our test corpus of documents, namely Naive Bayes and Maximum Entropy Models. The test corpus consisted of the five classes business cards, emails, (mostly scientific) publications, slides, and wikipedia articles, each containing between 101 and 225 documents.

For the evaluation, seven different types of features were extracted from the test documents and applied to the probabilistic models in different combinations. The following types of features were used:

  • Tokens: The whitespace-seperated tokens of the text (using the regular expression "[^
    w]" to split the tokens).
  • Mime Type: The mime type of the document, as returned by Apperture (note that Apperture occasionally was unable to return the correct mime type and returned null instead).
  • Titles: Only the tokens appearing in the documents' titles, using the same splitting rule as for the full text tokens (note that many documents did not have titles).
  • N-Grams: All 3-Grams for tokens appearing in the full text's individual tokens. For tokens of length <= 3, the full token was used.
  • Text length: The numbers of characters in the fulltext, divided logarithmically into seven categories: <=10, <=100, .. , <= 100000, and over 100000.
  • Non-contiguous substrings: All combinations of two characters that appear in this order in the fulltext and are at most 100 characters apart.
  • Number to letter ratio: A simple division into two category, depending on whether or not the fulltext contains more numbers than letters.

In order to accurately compare the two models and different feature combinations, a 4-fold cross validation was used, using 75% of the test corpus to train the models and 25% to test them. All in all, the following measures were taken:

  • Overall Accuracy
  • Time needed to train the model
  • Time needed to perform predictions
  • Prediction accuracy after only using 10%, 20%, .., 90% of the training data.
  • Confusion matrix

The detailed evaluation results can be found here (TODO)

The evaluation led to the following conclusions:

  • Except for some unusable feature combinations, Maximum Entropy produced considerably better results (with often times less than half as many errors made).
  • Maximum Entropy needed roughly 500 times as long for training as Naive Bayes.
  • In contrast, Maximum Entropy did not take considerably longer for predictions than Naive Bayes.
  • The most promising individual types of features were those with rather high dimensions: tokens, n-grams and non-contiguous substrings.
  • The accuracy for the afore-mentioned types of features were so high that combining them with some of the other features did not bring benefits that would be worth the extra effort.
  • Maximum Entropy reached accuracy values up to 99% (using non-contiguous substrings)
  • Naive Bayes reached 94% using tokens and 96% combining all types of features.
  • For the learning curve, it is worth mentioning that 10% (~70#) of the training set already produced usable results (e.g. 95% for Maximum Entropy / non-contiguous substrings)
  • Generally speaking, the learning curve reached its peak at between 70% and 100%.
  • Because of its high accuracy (~99%) and good prediction time, for the intended purpose it can be recommended to use Maximum Entropy with the non-contiguous substrings feature. However, training takes rather long (~40 seconds in our case). If faster training is needed, the following options can also be considered:
    • Maximum Entropy / tokens feature: 97% accuracy, 4s to train
    • Naive Bayes / all features: 96% accuracy, 0.15s to train (although feature extraction gets rather time-consuming here)
    • Naive Bayes /tokens feature ~94% accuracy, 0.1s for training
Last modified 15 years ago Last modified on 10/15/09 15:36:25