pine package

Submodules

pine.configuration module

pine.configuration.MODEL_BASENAMES(subwords, positions)
pine.configuration.MODEL_FRIENDLY_NAMES(subwords, positions)

pine.language_model module

class pine.language_model.LanguageModel(corpus: str, workspace: Union[pathlib.Path, str] = '.', language: str = 'en', subwords: bool = True, positions: Union[bool, str] = 'constrained', use_vocab_from: Optional[pine.language_model.LanguageModel] = None, friendly_name: Optional[str] = None, extra_fasttext_parameters: Optional[Dict] = None)

Bases: object

A log-bilinear language model.

Parameters
  • corpus (str) – The name of the corpus on which the language model will be trained. See the get_corpus() for a list of available corpora for a given language.

  • workspace ({Path, str}, optional) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel. The default workspace is the current working directory (.).

  • language (str, optional) – The language of the model. Determines the corpora and the evaluation tasks available for the model. The default language is English (en).

  • subwords (bool, optional) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events. By default, the model will use subwords (True).

  • positions ({bool,str}, optional) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when] By default, the model will use reduced dimensionality for positions (constrained).

  • use_vocab_from ({LanguageModel, NoneType}, optional) – Another trained log-bilinear language model to borrow corpus statistics (vocab) from to speed up the training. The other model must have been trained on the same corpus in the same language as this model. By default, we will do a preliminary pass over the corpus to retrieve the vocab (None).

  • friendly_name ({str, NoneType}, optional) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model. By default, the friendly name of the model will be automatically generated according to its parameters (None).

  • extra_fasttext_parameters ({dict, NoneType}, optional) – Extra parameters for the log-bilinear language model to override the defaults from FASTTEXT_PARAMETERS. There parameters are for the gensim.models.fasttext.FastText class constructor. By default, no extra parameters will be used (None).

Variables
  • corpus (Corpus) – The corpus on which the language model will be trained. a given language.

  • workspace (Path) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel.

  • language (str) – The language of the model. Determines the corpora and the evaluation tasks available for the model.

  • subwords (bool) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events.

  • positions ({bool,str}) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when]

  • friendly_name (str) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model.

  • fasttext_parameters (dict) – The full parameters of the log-bilinear language model for the gensim.models.fasttext.FastText class constructor.

  • model (gensim.models.fasttext.FastText) – The raw log-bilinear language model.

  • vectors (gensim.models.keyedvectors.KeyedVectors) – The word, subword, and positional embeddings of the log-bilinear language model.

  • training_duration (float) – The training duration of the model in seconds.

  • input_vectors (np.ndarray) – The input word vectors of the log-bilinear language model.

  • output_vectors (np.ndarray) – The output word vectors of the log-bilinear language model.

  • positional_vectors (np.ndarray) – The input positional vectors of the log-bilinear language model.

  • position_importance (PositionImportance) – The importance of positions. [novotny2021when]

  • positional_feature_clusters (ClusteredPositionalFeatures) – Clusters of positional features. [novotny2021when]

  • words (sequence of str) – All words in the dictionary of the log-bilinear model.

  • classified_context_words (dict of (str, str)) – All words in the dictionary of the log-bilinear model, classified to the individual clusters of positional features. [novotny2021when]

  • word_analogy (WordAnalogyResult) – The results of the log-bilinear language model on the word analogy task of Mikolov et al. (2013) [mikolov2013efficient].

  • language_modeling (LanguageModelingResult) – The results of the log-bilinear language model on the language modeling task of Novotný et al. (2021) [novotny2021when].

  • model_files (iterable of (Path, int)) – The individual files of the log-bilinear language model with their sizes in bytes.

  • cache_files (iterable of (Path, int)) – The individual cached data of the log-bilinear language model with their sizes in bytes.

References

The general log-bilinear language model was developed by Mikolov et al. (2013) [mikolov2013efficient]. The subword model was developed by Bojanowski et al. (2017) [bojanowski2017enriching]. The positional model was developed by Mikolov et al. (2018) [mikolov2018advances]. The constrained positional model was developed by Novotný et al. (2021) [novotny2021when].

mikolov2013efficient

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781v3

bojanowski2017enriching

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://arxiv.org/pdf/1607.04606.pdf

mikolov2018advances

Mikolov, T., et al. “Advances in Pre-Training Distributed Word Representations.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018. http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf

novotny2021when

Novotný, V., et al. “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting”. Manuscript submitted for publication.

property basename: str
property cache_dir: pathlib.Path
property cache_files: Iterable[Tuple[pathlib.Path, int]]
property classified_context_words: Dict[str, str]
classify_context_word(word: str)str

Classify a context word to a cluster of positional features. [novotny2021when]

Parameters

word (str) – A context word.

Returns

cluster_label – A label of a cluster of positional features to which the context word has been classified.

Return type

str

property corpus: Iterable[Iterable[str]]
property corpus_dir: pathlib.Path
property dataset_dir: pathlib.Path
property fasttext_parameters: Dict
property friendly_name: str
get_masked_word_probability(sentence: Sentence, masked_word: str, cluster_label: Optional[str] = None)SentenceProbability

Get the probability of a sentence given a masked word.

Parameters
  • sentence (Sentence) – A sentence.

  • masked_word (str) – A masked word.

  • cluster_label ({str, NoneType}, optional) – The cluster of positional features [novotny2021when] to use for the prediction. By default, we will use all features (None).

Returns

sentence_probability – The probability of the sentence given the masked word.

Return type

SentenceProbability

property input_vectors: numpy.ndarray
property language_modeling: LanguageModelingResult
property model: gensim.models.fasttext.FastText
property model_dir: pathlib.Path
property model_files: Iterable[Tuple[pathlib.Path, int]]
property output_vectors: numpy.ndarray
property position_importance: PositionImportance
property positional_feature_clusters: ClusteredPositionalFeatures
property positional_vectors: numpy.ndarray
predict_masked_words(sentence: Sentence)Iterable[str]

Predict masked words for a sentence.

Parameters

sentence (Sentence) – A sentence.

Returns

masked_words – The predicted masked words for the sentence in a descending order of probability.

Return type

iterable of str

print_files()

Pretty-print the individual files and cached data of the log-bilinear language model on the standard output.

produce_example_sentences(cluster_label: str)ExampleSentences

Produce two example sentences that characterize a cluster of positional features. [novotny2021when]

A context word from a cluster of positional features will be placed on two different positions of a sentence, where it produces the greatest difference in masked word predictions. This is a useful illustration of the behavior and the purpose of a cluster of positional features.

Parameters

cluster_label (str) – A label of a cluster of positional features.

Returns

example_sentences – Two example sentences that characterize the cluster of positional features.

Return type

ClusteredPositionalFeatures

property training_duration: float
property vectors: gensim.models.keyedvectors.KeyedVectors
property word_analogy: WordAnalogyResult
property words: Sequence[str]
class pine.language_model.TrainingDurationMeasure

Bases: gensim.models.callbacks.CallbackAny2Vec

on_epoch_begin(model)

Method called at the start of each epoch.

Parameters

model (Word2Vec or subclass) – Current model.

on_epoch_end(model)

Method called at the end of each epoch.

Parameters

model (Word2Vec or subclass) – Current model.

pine.util module

pine.util.download_to(url: str, path: pathlib.Path, size: Optional[int] = None, transformation: Optional[Callable[[str], str]] = None, extract_file: Optional[pathlib.Path] = None, buffer_size: int = 1048576)
pine.util.interpolate(X: numpy.ndarray, Y: numpy.ndarray, kind: Optional[str] = None)Tuple[numpy.ndarray, numpy.ndarray]
pine.util.parallel_simple_preprocess(pool, path: pathlib.Path, semaphore)Iterable[List[str]]
pine.util.produce(iterable: Iterable[pine.util.T], semaphore)Iterable[pine.util.T]
pine.util.simple_preprocess(document: str)List[str]
pine.util.stringify_parameters(parameters: Dict)str
pine.util.unzip_to(archive: pathlib.Path, result_dir: pathlib.Path, unlink_after: bool = False)

Module contents

class pine.LanguageModel(corpus: str, workspace: Union[pathlib.Path, str] = '.', language: str = 'en', subwords: bool = True, positions: Union[bool, str] = 'constrained', use_vocab_from: Optional[pine.language_model.LanguageModel] = None, friendly_name: Optional[str] = None, extra_fasttext_parameters: Optional[Dict] = None)

Bases: object

A log-bilinear language model.

Parameters
  • corpus (str) – The name of the corpus on which the language model will be trained. See the get_corpus() for a list of available corpora for a given language.

  • workspace ({Path, str}, optional) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel. The default workspace is the current working directory (.).

  • language (str, optional) – The language of the model. Determines the corpora and the evaluation tasks available for the model. The default language is English (en).

  • subwords (bool, optional) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events. By default, the model will use subwords (True).

  • positions ({bool,str}, optional) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when] By default, the model will use reduced dimensionality for positions (constrained).

  • use_vocab_from ({LanguageModel, NoneType}, optional) – Another trained log-bilinear language model to borrow corpus statistics (vocab) from to speed up the training. The other model must have been trained on the same corpus in the same language as this model. By default, we will do a preliminary pass over the corpus to retrieve the vocab (None).

  • friendly_name ({str, NoneType}, optional) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model. By default, the friendly name of the model will be automatically generated according to its parameters (None).

  • extra_fasttext_parameters ({dict, NoneType}, optional) – Extra parameters for the log-bilinear language model to override the defaults from FASTTEXT_PARAMETERS. There parameters are for the gensim.models.fasttext.FastText class constructor. By default, no extra parameters will be used (None).

Variables
  • corpus (Corpus) – The corpus on which the language model will be trained. a given language.

  • workspace (Path) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel.

  • language (str) – The language of the model. Determines the corpora and the evaluation tasks available for the model.

  • subwords (bool) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events.

  • positions ({bool,str}) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when]

  • friendly_name (str) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model.

  • fasttext_parameters (dict) – The full parameters of the log-bilinear language model for the gensim.models.fasttext.FastText class constructor.

  • model (gensim.models.fasttext.FastText) – The raw log-bilinear language model.

  • vectors (gensim.models.keyedvectors.KeyedVectors) – The word, subword, and positional embeddings of the log-bilinear language model.

  • training_duration (float) – The training duration of the model in seconds.

  • input_vectors (np.ndarray) – The input word vectors of the log-bilinear language model.

  • output_vectors (np.ndarray) – The output word vectors of the log-bilinear language model.

  • positional_vectors (np.ndarray) – The input positional vectors of the log-bilinear language model.

  • position_importance (PositionImportance) – The importance of positions. [novotny2021when]

  • positional_feature_clusters (ClusteredPositionalFeatures) – Clusters of positional features. [novotny2021when]

  • words (sequence of str) – All words in the dictionary of the log-bilinear model.

  • classified_context_words (dict of (str, str)) – All words in the dictionary of the log-bilinear model, classified to the individual clusters of positional features. [novotny2021when]

  • word_analogy (WordAnalogyResult) – The results of the log-bilinear language model on the word analogy task of Mikolov et al. (2013) [mikolov2013efficient].

  • language_modeling (LanguageModelingResult) – The results of the log-bilinear language model on the language modeling task of Novotný et al. (2021) [novotny2021when].

  • model_files (iterable of (Path, int)) – The individual files of the log-bilinear language model with their sizes in bytes.

  • cache_files (iterable of (Path, int)) – The individual cached data of the log-bilinear language model with their sizes in bytes.

References

The general log-bilinear language model was developed by Mikolov et al. (2013) [mikolov2013efficient]. The subword model was developed by Bojanowski et al. (2017) [bojanowski2017enriching]. The positional model was developed by Mikolov et al. (2018) [mikolov2018advances]. The constrained positional model was developed by Novotný et al. (2021) [novotny2021when].

mikolov2013efficient

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781v3

bojanowski2017enriching

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://arxiv.org/pdf/1607.04606.pdf

mikolov2018advances

Mikolov, T., et al. “Advances in Pre-Training Distributed Word Representations.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018. http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf

novotny2021when

Novotný, V., et al. “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting”. Manuscript submitted for publication.

property basename: str
property cache_dir: pathlib.Path
property cache_files: Iterable[Tuple[pathlib.Path, int]]
property classified_context_words: Dict[str, str]
classify_context_word(word: str)str

Classify a context word to a cluster of positional features. [novotny2021when]

Parameters

word (str) – A context word.

Returns

cluster_label – A label of a cluster of positional features to which the context word has been classified.

Return type

str

property corpus: Iterable[Iterable[str]]
property corpus_dir: pathlib.Path
property dataset_dir: pathlib.Path
property fasttext_parameters: Dict
property friendly_name: str
get_masked_word_probability(sentence: Sentence, masked_word: str, cluster_label: Optional[str] = None)SentenceProbability

Get the probability of a sentence given a masked word.

Parameters
  • sentence (Sentence) – A sentence.

  • masked_word (str) – A masked word.

  • cluster_label ({str, NoneType}, optional) – The cluster of positional features [novotny2021when] to use for the prediction. By default, we will use all features (None).

Returns

sentence_probability – The probability of the sentence given the masked word.

Return type

SentenceProbability

property input_vectors: numpy.ndarray
property language_modeling: LanguageModelingResult
property model: gensim.models.fasttext.FastText
property model_dir: pathlib.Path
property model_files: Iterable[Tuple[pathlib.Path, int]]
property output_vectors: numpy.ndarray
property position_importance: PositionImportance
property positional_feature_clusters: ClusteredPositionalFeatures
property positional_vectors: numpy.ndarray
predict_masked_words(sentence: Sentence)Iterable[str]

Predict masked words for a sentence.

Parameters

sentence (Sentence) – A sentence.

Returns

masked_words – The predicted masked words for the sentence in a descending order of probability.

Return type

iterable of str

print_files()

Pretty-print the individual files and cached data of the log-bilinear language model on the standard output.

produce_example_sentences(cluster_label: str)ExampleSentences

Produce two example sentences that characterize a cluster of positional features. [novotny2021when]

A context word from a cluster of positional features will be placed on two different positions of a sentence, where it produces the greatest difference in masked word predictions. This is a useful illustration of the behavior and the purpose of a cluster of positional features.

Parameters

cluster_label (str) – A label of a cluster of positional features.

Returns

example_sentences – Two example sentences that characterize the cluster of positional features.

Return type

ClusteredPositionalFeatures

property training_duration: float
property vectors: gensim.models.keyedvectors.KeyedVectors
property word_analogy: WordAnalogyResult
property words: Sequence[str]