Help me understand the example of using php-ml

I spent a couple of hours trying to figure out the example for php-ml. I don't understand what is happening in this example. The network is trained from a csv file of the format "Текст предложения","язык" Here is the example itself

$dataset = new CsvDataset('data/languages.csv', 1);
$vectorizer = new TokenCountVectorizer(new WordTokenizer());
$tfIdfTransformer = new TfIdfTransformer();
$samples = [];
foreach ($dataset->getSamples() as $sample) {
    $samples[] = $sample[0];
}
$vectorizer->fit($samples);
$vectorizer->transform($samples);
$tfIdfTransformer->fit($samples);
$tfIdfTransformer->transform($samples);
$dataset = new ArrayDataset($samples, $dataset->getTargets());
$randomSplit = new StratifiedRandomSplit($dataset, 0.1);
$classifier = new SVC(Kernel::RBF, 10000);
$classifier->train($randomSplit->getTrainSamples(), $randomSplit->getTrainLabels());
$predictedLabels = $classifier->predict($randomSplit->getTestSamples());
echo 'Accuracy: '.Accuracy::score($randomSplit->getTestLabels(), $predictedLabels);

It is not clear exactly where the input data is being entered. I only realized that the input is the data on which the network was trained.

How do I submit my text to the input and determine how likely it is that this sentence is written in English?

Here is a link to the example itself php-ml-examples/classification/languageDetection.php

And the library https://github.com/php-ai/php-ml

Author: ilyaplot, 2017-09-20

1 answers

The code is written in a classic style.

Code with comments (as they are indicated on php I do not know, so do not be angry). There may be inaccuracies in the comments, as I did not understand the library you are using.

# Читаем датку из csv в таблицу объект-признак (судя по коду, в целом, там текстовые данные)
$dataset = new CsvDataset('data/languages.csv', 1);
# Создаём объект парсера
$vectorizer = new TokenCountVectorizer(new WordTokenizer());
# Создаём Tf-Idf преобразователь
$tfIdfTransformer = new TfIdfTransformer();
$samples = [];
# Запихиваем датку объект за объектом в массив (построчно)
foreach ($dataset->getSamples() as $sample) {
    $samples[] = $sample[0];
}
# Разбиваем дату на токены ("инициализация") и получаем непосредственно разбитые данные (вероятно, на слова)
$vectorizer->fit($samples);
$vectorizer->transform($samples);
# Строим Tf-Idf("инициализация") и получаем непосредственно разбитые данные (вероятно, на слова)
$tfIdfTransformer->fit($samples);
$tfIdfTransformer->transform($samples);
# Преобразуем данные к удобному виду (X, y) == (фичи, класс)
$dataset = new ArrayDataset($samples, $dataset->getTargets());
# Делим данные на 2 части: обучающее множество и валидационное
$randomSplit = new StratifiedRandomSplit($dataset, 0.1);
# Создаём объект модели с ядром RBF, второй параметр не понимаю зачем нужен
$classifier = new SVC(Kernel::RBF, 10000);
# Треним модель на trainSet
$classifier->train($randomSplit->getTrainSamples(), $randomSplit->getTrainLabels());
# Делаем оценку и смотрим на результат на тестовом множестве
$predictedLabels = $classifier->predict($randomSplit->getTestSamples());
echo 'Accuracy: '.Accuracy::score($randomSplit->getTestLabels(), $predictedLabels);

There is also a feeling that there is a logic bomb in the code, which is that data is passed to tfIdfTransformer and vectorizer along with class labels. In addition, I have some doubts about the constructions associated with vectorizer, tfIdfTransformer.

In addition, it would be good to watch not only precision, but also recall. You can read about this here. This question is related to the fact that the sample may be unbalanced, i.e. in a class 0 You have 100,000 messages, and in the class 1 -- 100. In this case, the estimate will be too high. Also, for your problem, in this statement, if there are 2 classes, you can use the F-measure. You can find more information in one of these blogs about methods for evaluating algorithms: times, two. Of course, in your case, this aspect is not compiled, since the division into testSet and trainSet is performed in a balanced way (this is indicated by the word Stratified). But if the data is inherently unbalanced, then there is fundamentally nothing you can do about it. And it is necessary to count both recall and precision.

I also note that CrossValidation is not performed, so the accuracy of the model estimation can be very high. vary. Cross-validation consists in the fact that the division into a test and validation set is performed repeatedly. In your case, this is done randomly. I.e.:

  • B the test data set is taken by 10%, the training data set is taken by 90%
  • The model is trained on the test set
  • The model is evaluated
  • The test data set is broken down in another way by 10% into the test set, into the main data set. -- 90%.
  • ...

This is done many times. All the results obtained are averaged. Due to averaging, errors in the estimation are compensated.

 4
Author: hedgehogues, 2017-09-27 09:57:35