wiki:tutorial/ProbabilisticLearningModels

Tutorial About Probabilistic Classification Models

Benjamin Adrian, Gunnar Grimnes, Jörn Hees, Matthias Sperber

Abstract

Introduction

Classification in general is the problem of deciding for a given input to which class it belongs. Usually classification can be subdivided into a learning phase (aka training phase) and a classification phase (aka test phase). (TODO: offline, online, reinforcement, ... learning)

Basics

There are several basics to concern and understand before diving into probabilistic learning models.

Example / Instance

Examples or also called instances are the basic entities in this field. They occur as training examples, as validation or test examples, and finally as real data.

E.g., In a document classification scenario, examples are documents. 
Already classified documents are used for training or evaluation purpose.

Feature

A feature is a descriptive property of an example. Features are processible by machines.

E.g., In a document classification scenario, features might be the words of a document.
In consequence, single features might describe multiple examples (here documents)

Feature Extraction

Feature extraction is the task of extracting features from examples.

E.g., In our document classification scenario, a tokenizer that extracts words from 
text might be used for feature extraction.

In more sophisticated scenarios, feature extraction can be hierarchically nested by extracting new features from existing feature lists.

E.g., In our document classification scenario, a word n-gram algorithm extracts n-gram 
features from extracted word sequences. 

Feature Selection

Each feature for each example has be processed by model trainers or executors. There are several reasons for selecting just subsets of existing features. First, not all features are useful for separating different classes. In details, there is no statistically significant dependency between class and feature occurance.

E.g., In our document classification scenario, stop words or high frequent words are 
not useful for separating e.g., spam mails from ham mails.

Second, just a small set of features might be enough for classifiying examples successfully. Adding more just decreases performance.

Main Steps

  1. Convert your problem into a classification problem
  2. Get a pre-classified data set (the more data or even data sets the better). Devide it into test, training and development sets.
  3. Think about your features. This is the most important step!
  4. Process data, extract these features, select significant ones and store them.
  5. Train your model with your training data
  6. Classify your test data.
  7. Evaluate results

Relational Classification

Relational data consists of entities, described by features and statisitcal dependencies between entities.

Nearest Neighbor (1-NN or NN)

Nearest Neighbor classifiers are classifiers of the most simple kind. In the training phase they simply record the class for each sample. Later in the classification phase they calculate the distances of the query to all samples in their records and return the class of the sample which is closest to the query.

$k$NN

The $k$-Nearest Neighbor classifier is a generalization of the simple NN, which does not immediately return the single "best match" sample's class, but inspects the nearest $k$ samples to the query and returns a class depending on a merging function, such as:

  • most often observed class
  • classes weighted by inverse distances

Naive Bayes

Naive Bayes Classificators are

TracMath macro processor has detected an error. Please fix the problem before continuing.


The command:

'/usr/bin/pdflatex -interaction=nonstopmode a111c55d9e535d1f2efdb89ba85f72dc19ef7127.tex'
failed with the following output:
"This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=pdflatex)\n restricted \\write18 enabled.\nentering extended mode\n(./a111c55d9e535d1f2efdb89ba85f72dc19ef7127.tex\nLaTeX2e <2018-12-01>\n(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls\nDocument Class: article 2018/09/03 v1.4i Standard LaTeX document class\n(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))\n(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty)\n\n! LaTeX Error: File `cmap.sty' not found.\n\nType X to quit or <RETURN> to proceed,\nor enter new name. (Default extension: sty)\n\nEnter file name: \n! Emergency stop.\n<read *> \n         \nl.4 \\usepackage\n               {type1ec}^^M\n!  ==> Fatal error occurred, no output PDF file produced!\nTranscript written on a111c55d9e535d1f2efdb89ba85f72dc19ef7127.log.\n"

Maximum Entropy

Multi Layer Perceptrons

Support Vector Machines

Sequential Classification

Hidden Markov Model

Conditional Random Field

A conditional random field is a conditional distribution

TracMath macro processor has detected an error. Please fix the problem before continuing.


The command:

'/usr/bin/pdflatex -interaction=nonstopmode ff11279b2952f70671d7f035661dd1c9e4e72c76.tex'
failed with the following output:
"This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=pdflatex)\n restricted \\write18 enabled.\nentering extended mode\n(./ff11279b2952f70671d7f035661dd1c9e4e72c76.tex\nLaTeX2e <2018-12-01>\n(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls\nDocument Class: article 2018/09/03 v1.4i Standard LaTeX document class\n(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))\n(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty)\n\n! LaTeX Error: File `cmap.sty' not found.\n\nType X to quit or <RETURN> to proceed,\nor enter new name. (Default extension: sty)\n\nEnter file name: \n! Emergency stop.\n<read *> \n         \nl.4 \\usepackage\n               {type1ec}^^M\n!  ==> Fatal error occurred, no output PDF file produced!\nTranscript written on ff11279b2952f70671d7f035661dd1c9e4e72c76.log.\n"

with an associated graphical structure.

Appendix

Mathematical Foundations

Bayes Rule

TracMath macro processor has detected an error. Please fix the problem before continuing.


The command:

'/usr/bin/pdflatex -interaction=nonstopmode ff11279b2952f70671d7f035661dd1c9e4e72c76.tex'
failed with the following output:
"This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2019/dev/Debian) (preloaded format=pdflatex)\n restricted \\write18 enabled.\nentering extended mode\n(./ff11279b2952f70671d7f035661dd1c9e4e72c76.tex\nLaTeX2e <2018-12-01>\n(/usr/share/texlive/texmf-dist/tex/latex/base/article.cls\nDocument Class: article 2018/09/03 v1.4i Standard LaTeX document class\n(/usr/share/texlive/texmf-dist/tex/latex/base/size10.clo))\n(/usr/share/texlive/texmf-dist/tex/latex/base/inputenc.sty)\n\n! LaTeX Error: File `cmap.sty' not found.\n\nType X to quit or <RETURN> to proceed,\nor enter new name. (Default extension: sty)\n\nEnter file name: \n! Emergency stop.\n<read *> \n         \nl.4 \\usepackage\n               {type1ec}^^M\n!  ==> Fatal error occurred, no output PDF file produced!\nTranscript written on ff11279b2952f70671d7f035661dd1c9e4e72c76.log.\n"

Last modified 15 years ago Last modified on 08/17/09 08:49:31