pine package¶

Subpackages¶

Submodules¶

pine.configuration module¶

pine.configuration.MODEL_BASENAMES(subwords, positions)¶

pine.configuration.MODEL_FRIENDLY_NAMES(subwords, positions)¶

pine.language_model module¶

class pine.language_model.LanguageModel(corpus: str, workspace: Union[pathlib.Path, str] = '.', language: str = 'en', subwords: bool = True, positions: Union[bool, str] = 'constrained', use_vocab_from: Optional[pine.language_model.LanguageModel] = None, friendly_name: Optional[str] = None, extra_fasttext_parameters: Optional[Dict] = None)¶

Bases: object

A log-bilinear language model.

Parameters

corpus (str) – The name of the corpus on which the language model will be trained. See the get_corpus() for a list of available corpora for a given language.
workspace ({Path, str}, optional) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel. The default workspace is the current working directory (.).
language (str, optional) – The language of the model. Determines the corpora and the evaluation tasks available for the model. The default language is English (en).
subwords (bool, optional) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events. By default, the model will use subwords (True).
positions ({bool,str}, optional) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when] By default, the model will use reduced dimensionality for positions (constrained).
use_vocab_from ({LanguageModel, NoneType}, optional) – Another trained log-bilinear language model to borrow corpus statistics (vocab) from to speed up the training. The other model must have been trained on the same corpus in the same language as this model. By default, we will do a preliminary pass over the corpus to retrieve the vocab (None).
friendly_name ({str, NoneType}, optional) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model. By default, the friendly name of the model will be automatically generated according to its parameters (None).
extra_fasttext_parameters ({dict, NoneType}, optional) – Extra parameters for the log-bilinear language model to override the defaults from FASTTEXT_PARAMETERS. There parameters are for the gensim.models.fasttext.FastText class constructor. By default, no extra parameters will be used (None).

Variables

corpus (Corpus) – The corpus on which the language model will be trained. a given language.
workspace (Path) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel.
language (str) – The language of the model. Determines the corpora and the evaluation tasks available for the model.
subwords (bool) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events.
positions ({bool,str}) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when]
friendly_name (str) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model.
fasttext_parameters (dict) – The full parameters of the log-bilinear language model for the gensim.models.fasttext.FastText class constructor.
model (gensim.models.fasttext.FastText) – The raw log-bilinear language model.
vectors (gensim.models.keyedvectors.KeyedVectors) – The word, subword, and positional embeddings of the log-bilinear language model.
training_duration (float) – The training duration of the model in seconds.
input_vectors (np.ndarray) – The input word vectors of the log-bilinear language model.
output_vectors (np.ndarray) – The output word vectors of the log-bilinear language model.
positional_vectors (np.ndarray) – The input positional vectors of the log-bilinear language model.
position_importance (PositionImportance) – The importance of positions. [novotny2021when]
positional_feature_clusters (ClusteredPositionalFeatures) – Clusters of positional features. [novotny2021when]
words (sequence of str) – All words in the dictionary of the log-bilinear model.
classified_context_words (dict of (str, str)) – All words in the dictionary of the log-bilinear model, classified to the individual clusters of positional features. [novotny2021when]
word_analogy (WordAnalogyResult) – The results of the log-bilinear language model on the word analogy task of Mikolov et al. (2013) [mikolov2013efficient].
language_modeling (LanguageModelingResult) – The results of the log-bilinear language model on the language modeling task of Novotný et al. (2021) [novotny2021when].
model_files (iterable of (Path, int)) – The individual files of the log-bilinear language model with their sizes in bytes.
cache_files (iterable of (Path, int)) – The individual cached data of the log-bilinear language model with their sizes in bytes.

References

The general log-bilinear language model was developed by Mikolov et al. (2013) [mikolov2013efficient]. The subword model was developed by Bojanowski et al. (2017) [bojanowski2017enriching]. The positional model was developed by Mikolov et al. (2018) [mikolov2018advances]. The constrained positional model was developed by Novotný et al. (2021) [novotny2021when].

mikolov2013efficient: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781v3
bojanowski2017enriching: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://arxiv.org/pdf/1607.04606.pdf
mikolov2018advances: Mikolov, T., et al. “Advances in Pre-Training Distributed Word Representations.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018. http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf
novotny2021when: Novotný, V., et al. “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting”. Manuscript submitted for publication.

property basename: str¶

property cache_dir: pathlib.Path¶

property cache_files: Iterable[Tuple[pathlib.Path, int]]¶

property classified_context_words: Dict[str, str]¶

classify_context_word(word: str) → str¶

Classify a context word to a cluster of positional features. [novotny2021when]

Parameters: word (str) – A context word.
Returns: cluster_label – A label of a cluster of positional features to which the context word has been classified.
Return type: str

property corpus: Iterable[Iterable[str]]¶

property corpus_dir: pathlib.Path¶

property dataset_dir: pathlib.Path¶

property fasttext_parameters: Dict¶

property friendly_name: str¶

get_masked_word_probability(sentence: Sentence, masked_word: str, cluster_label: Optional[str] = None) → SentenceProbability¶

Get the probability of a sentence given a masked word.

Parameters

sentence (Sentence) – A sentence.
masked_word (str) – A masked word.
cluster_label ({str, NoneType}, optional) – The cluster of positional features [novotny2021when] to use for the prediction. By default, we will use all features (None).

Returns

sentence_probability – The probability of the sentence given the masked word.

Return type

SentenceProbability

property input_vectors: numpy.ndarray¶

property language_modeling: LanguageModelingResult¶

property model: gensim.models.fasttext.FastText¶

property model_dir: pathlib.Path¶

property model_files: Iterable[Tuple[pathlib.Path, int]]¶

property output_vectors: numpy.ndarray¶

property position_importance: PositionImportance¶

property positional_feature_clusters: ClusteredPositionalFeatures¶

property positional_vectors: numpy.ndarray¶

predict_masked_words(sentence: Sentence) → Iterable[str]¶

Predict masked words for a sentence.

Parameters: sentence (Sentence) – A sentence.
Returns: masked_words – The predicted masked words for the sentence in a descending order of probability.
Return type: iterable of str

print_files()¶: Pretty-print the individual files and cached data of the log-bilinear language model on the standard output.

produce_example_sentences(cluster_label: str) → ExampleSentences¶

Produce two example sentences that characterize a cluster of positional features. [novotny2021when]

A context word from a cluster of positional features will be placed on two different positions of a sentence, where it produces the greatest difference in masked word predictions. This is a useful illustration of the behavior and the purpose of a cluster of positional features.

Parameters: cluster_label (str) – A label of a cluster of positional features.
Returns: example_sentences – Two example sentences that characterize the cluster of positional features.
Return type: ClusteredPositionalFeatures

property training_duration: float¶

property vectors: gensim.models.keyedvectors.KeyedVectors¶

property word_analogy: WordAnalogyResult¶

property words: Sequence[str]¶

class pine.language_model.TrainingDurationMeasure¶

Bases: gensim.models.callbacks.CallbackAny2Vec

on_epoch_begin(model)¶

Method called at the start of each epoch.

Parameters: model (Word2Vec or subclass) – Current model.

on_epoch_end(model)¶

Method called at the end of each epoch.

Parameters: model (Word2Vec or subclass) – Current model.

pine.util module¶

pine.util.download_to(url: str, path: pathlib.Path, size: Optional[int] = None, transformation: Optional[Callable[[str], str]] = None, extract_file: Optional[pathlib.Path] = None, buffer_size: int = 1048576)¶

pine.util.interpolate(X: numpy.ndarray, Y: numpy.ndarray, kind: Optional[str] = None) → Tuple[numpy.ndarray, numpy.ndarray]¶

pine.util.parallel_simple_preprocess(pool, path: pathlib.Path, semaphore) → Iterable[List[str]]¶

pine.util.produce(iterable: Iterable[pine.util.T], semaphore) → Iterable[pine.util.T]¶

pine.util.simple_preprocess(document: str) → List[str]¶

pine.util.stringify_parameters(parameters: Dict) → str¶

pine.util.unzip_to(archive: pathlib.Path, result_dir: pathlib.Path, unlink_after: bool = False)¶

Module contents¶

class pine.LanguageModel(corpus: str, workspace: Union[pathlib.Path, str] = '.', language: str = 'en', subwords: bool = True, positions: Union[bool, str] = 'constrained', use_vocab_from: Optional[pine.language_model.LanguageModel] = None, friendly_name: Optional[str] = None, extra_fasttext_parameters: Optional[Dict] = None)¶

Bases: object

A log-bilinear language model.

Parameters

corpus (str) – The name of the corpus on which the language model will be trained. See the get_corpus() for a list of available corpora for a given language.
workspace ({Path, str}, optional) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel. The default workspace is the current working directory (.).
language (str, optional) – The language of the model. Determines the corpora and the evaluation tasks available for the model. The default language is English (en).
subwords (bool, optional) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events. By default, the model will use subwords (True).
positions ({bool,str}, optional) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when] By default, the model will use reduced dimensionality for positions (constrained).
use_vocab_from ({LanguageModel, NoneType}, optional) – Another trained log-bilinear language model to borrow corpus statistics (vocab) from to speed up the training. The other model must have been trained on the same corpus in the same language as this model. By default, we will do a preliminary pass over the corpus to retrieve the vocab (None).
friendly_name ({str, NoneType}, optional) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model. By default, the friendly name of the model will be automatically generated according to its parameters (None).
extra_fasttext_parameters ({dict, NoneType}, optional) – Extra parameters for the log-bilinear language model to override the defaults from FASTTEXT_PARAMETERS. There parameters are for the gensim.models.fasttext.FastText class constructor. By default, no extra parameters will be used (None).

Variables

corpus (Corpus) – The corpus on which the language model will be trained. a given language.
workspace (Path) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel.
language (str) – The language of the model. Determines the corpora and the evaluation tasks available for the model.
subwords (bool) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events.
positions ({bool,str}) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2021when]
friendly_name (str) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model.
fasttext_parameters (dict) – The full parameters of the log-bilinear language model for the gensim.models.fasttext.FastText class constructor.
model (gensim.models.fasttext.FastText) – The raw log-bilinear language model.
vectors (gensim.models.keyedvectors.KeyedVectors) – The word, subword, and positional embeddings of the log-bilinear language model.
training_duration (float) – The training duration of the model in seconds.
input_vectors (np.ndarray) – The input word vectors of the log-bilinear language model.
output_vectors (np.ndarray) – The output word vectors of the log-bilinear language model.
positional_vectors (np.ndarray) – The input positional vectors of the log-bilinear language model.
position_importance (PositionImportance) – The importance of positions. [novotny2021when]
positional_feature_clusters (ClusteredPositionalFeatures) – Clusters of positional features. [novotny2021when]
words (sequence of str) – All words in the dictionary of the log-bilinear model.
classified_context_words (dict of (str, str)) – All words in the dictionary of the log-bilinear model, classified to the individual clusters of positional features. [novotny2021when]
word_analogy (WordAnalogyResult) – The results of the log-bilinear language model on the word analogy task of Mikolov et al. (2013) [mikolov2013efficient].
language_modeling (LanguageModelingResult) – The results of the log-bilinear language model on the language modeling task of Novotný et al. (2021) [novotny2021when].
model_files (iterable of (Path, int)) – The individual files of the log-bilinear language model with their sizes in bytes.
cache_files (iterable of (Path, int)) – The individual cached data of the log-bilinear language model with their sizes in bytes.

References

The general log-bilinear language model was developed by Mikolov et al. (2013) [mikolov2013efficient]. The subword model was developed by Bojanowski et al. (2017) [bojanowski2017enriching]. The positional model was developed by Mikolov et al. (2018) [mikolov2018advances]. The constrained positional model was developed by Novotný et al. (2021) [novotny2021when].

mikolov2013efficient: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781v3
bojanowski2017enriching: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://arxiv.org/pdf/1607.04606.pdf
mikolov2018advances: Mikolov, T., et al. “Advances in Pre-Training Distributed Word Representations.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018. http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf
novotny2021when: Novotný, V., et al. “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting”. Manuscript submitted for publication.

property basename: str¶

property cache_dir: pathlib.Path¶

property cache_files: Iterable[Tuple[pathlib.Path, int]]¶

property classified_context_words: Dict[str, str]¶

classify_context_word(word: str) → str¶

Classify a context word to a cluster of positional features. [novotny2021when]

Parameters: word (str) – A context word.
Returns: cluster_label – A label of a cluster of positional features to which the context word has been classified.
Return type: str

property corpus: Iterable[Iterable[str]]¶

property corpus_dir: pathlib.Path¶

property dataset_dir: pathlib.Path¶

property fasttext_parameters: Dict¶

property friendly_name: str¶

get_masked_word_probability(sentence: Sentence, masked_word: str, cluster_label: Optional[str] = None) → SentenceProbability¶

Get the probability of a sentence given a masked word.

Parameters

sentence (Sentence) – A sentence.
masked_word (str) – A masked word.
cluster_label ({str, NoneType}, optional) – The cluster of positional features [novotny2021when] to use for the prediction. By default, we will use all features (None).

Returns

sentence_probability – The probability of the sentence given the masked word.

Return type

SentenceProbability

property input_vectors: numpy.ndarray¶

property language_modeling: LanguageModelingResult¶

property model: gensim.models.fasttext.FastText¶

property model_dir: pathlib.Path¶

property model_files: Iterable[Tuple[pathlib.Path, int]]¶

property output_vectors: numpy.ndarray¶

property position_importance: PositionImportance¶

property positional_feature_clusters: ClusteredPositionalFeatures¶

property positional_vectors: numpy.ndarray¶

predict_masked_words(sentence: Sentence) → Iterable[str]¶

Predict masked words for a sentence.

Parameters: sentence (Sentence) – A sentence.
Returns: masked_words – The predicted masked words for the sentence in a descending order of probability.
Return type: iterable of str

print_files()¶: Pretty-print the individual files and cached data of the log-bilinear language model on the standard output.

produce_example_sentences(cluster_label: str) → ExampleSentences¶

Produce two example sentences that characterize a cluster of positional features. [novotny2021when]

A context word from a cluster of positional features will be placed on two different positions of a sentence, where it produces the greatest difference in masked word predictions. This is a useful illustration of the behavior and the purpose of a cluster of positional features.

Parameters: cluster_label (str) – A label of a cluster of positional features.
Returns: example_sentences – Two example sentences that characterize the cluster of positional features.
Return type: ClusteredPositionalFeatures

property training_duration: float¶

property vectors: gensim.models.keyedvectors.KeyedVectors¶

property word_analogy: WordAnalogyResult¶

property words: Sequence[str]¶

pine package¶

Subpackages¶

Submodules¶

pine.configuration module¶

pine.language_model module¶

pine.util module¶

Module contents¶

pine

Navigation

Related Topics