What is Latent Dirichlet Allocation?
A recently started a project classifying documents with Latent Dirichlet Allocation. Â It seems to be a way of generating topic models. Â Wikipedia says,
In statistics, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.Â
What is a generative model (incontrast to a discriminative model)?
a generative model is a full probabilistic model of all variables, whereas a discriminative model provides a model only for the target variable(s) conditional on the observed variables. Thus a generative model can be used, for example, to simulate (i.e. generate) values of any variable in the model, whereas a discriminative model allows only sampling of the target variables conditional on the observed quantities
Examples of generative models include:
Gaussian mixture model and other types of mixture model
Probabilistic context-free grammar
Averaged one-dependence estimators
Latent Dirichlet allocation
Restricted Boltzmann machine
Examples of discriminative models used in machine learning include:
Logistic regression(aka maximum entropy classifiers)
Linear discriminant analysis
Boosting (meta-algorithm)
Conditional random fields
Snarxiv uses a context-free grammar built from the original arXiv site. Â Which is the real article?
A soluble model for scattering and decay in quaternionic quantum mechanics 1: decay
On Extremal Topological Field Theories on Del-Pezzos
The words or document features are "observed" and the "topics" are unobserved. Â I have found a few libraries online which implement these for me:
Princeton CS professor David Blei who seems to be the master at topic models. Here's an actual topic analysis of arXiv he gathered, including:
{ group groups ring finite field }
{ surface current transport electronic josephson }
{ energy nuclear model state calculations }
I wonder if I can verify their linguistic relationships with actual ones. Â I would like to try this for TechCrunch. Â This topic lies in the middle of semantics and statistical algorithms.
To end our overview: Latent Dirichlet Allocation is like this
While probabilistic latent semantic analysis is like this
Gaussian mixture model looks like this:
Next time: what do these plate diagrams mean? Â What does LDA actually do? Â What is a mixture model?