pine package¶
Subpackages¶
- pine.corpus package
- pine.qualitative_evaluation package
- pine.quantitative_evaluation package
- Subpackages
- pine.quantitative_evaluation.language_modeling package
- Submodules
- pine.quantitative_evaluation.language_modeling.data module
- pine.quantitative_evaluation.language_modeling.language_modeling module
- pine.quantitative_evaluation.language_modeling.model module
- pine.quantitative_evaluation.language_modeling.training module
- pine.quantitative_evaluation.language_modeling.view module
- Module contents
- pine.quantitative_evaluation.text_classification package
- pine.quantitative_evaluation.language_modeling package
- Submodules
- pine.quantitative_evaluation.word_analogy module
- Module contents
- Subpackages
Submodules¶
pine.configuration module¶
- pine.configuration.MODEL_BASENAMES(subwords, positions)¶
- pine.configuration.MODEL_FRIENDLY_NAMES(subwords, positions)¶
pine.language_model module¶
- class pine.language_model.LanguageModel(corpus: Union[str, Iterable[Iterable[str]]], workspace: Union[pathlib.Path, str] = '.', language: str = 'en', subwords: bool = True, positions: Union[bool, str] = 'constrained', use_vocab_from: Optional[pine.language_model.LanguageModel] = None, friendly_name: Optional[str] = None, extra_fasttext_parameters: Optional[Dict] = None)¶
Bases:
objectA log-bilinear language model.
- Parameters
corpus ({str, Corpus}) – The name of the corpus on which the language model will be trained. See the
get_corpus()for a list of available corpora for a given language and their names. Alternatively, a custom corpus.workspace ({Path, str}, optional) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel. The default workspace is the current working directory (.).
language (str, optional) – The language of the model. Determines the corpora and the evaluation tasks available for the model. The default language is English (en).
subwords (bool, optional) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events. By default, the model will use subwords (True).
positions ({bool,str}, optional) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2022when] By default, the model will use reduced dimensionality for positions (constrained).
use_vocab_from ({LanguageModel, NoneType}, optional) – Another trained log-bilinear language model to borrow corpus statistics (vocab) from to speed up the training. The other model must have been trained on the same corpus in the same language as this model. By default, we will do a preliminary pass over the corpus to retrieve the vocab (None).
friendly_name ({str, NoneType}, optional) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model. By default, the friendly name of the model will be automatically generated according to its parameters (None).
extra_fasttext_parameters ({dict, NoneType}, optional) – Extra parameters for the log-bilinear language model to override the defaults from
FASTTEXT_PARAMETERS. There parameters are for the gensim.models.fasttext.FastText class constructor. By default, no extra parameters will be used (None).
- Variables
corpus (
Corpus) – The corpus on which the language model will be trained. a given language.workspace (Path) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel.
language (str) – The language of the model. Determines the corpora and the evaluation tasks available for the model.
subwords (bool) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events.
positions ({bool,str}) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2022when]
friendly_name (str) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model.
fasttext_parameters (dict) – The full parameters of the log-bilinear language model for the gensim.models.fasttext.FastText class constructor.
model (gensim.models.fasttext.FastText) – The raw log-bilinear language model.
vectors (gensim.models.keyedvectors.KeyedVectors) – The word, subword, and positional embeddings of the log-bilinear language model.
training_duration (float) – The training duration of the model in seconds.
input_vectors (np.ndarray) – The input word vectors of the log-bilinear language model.
output_vectors (np.ndarray) – The output word vectors of the log-bilinear language model.
positional_vectors (np.ndarray) – The input positional vectors of the log-bilinear language model.
position_importance (
PositionImportance) – The importance of positions. [novotny2022when]positional_feature_clusters (
ClusteredPositionalFeatures) – Clusters of positional features. [novotny2022when]words (sequence of str) – All words in the dictionary of the log-bilinear model.
classified_context_words (dict of (str, str)) – All words in the dictionary of the log-bilinear model, classified to the individual clusters of positional features. [novotny2022when]
word_analogy (
WordAnalogyResult) – The results of the log-bilinear language model on the word analogy task of Mikolov et al. (2013) [mikolov2013efficient].language_modeling (
LanguageModelingResult) – The results of the log-bilinear language model on the language modeling task of Novotný et al. (2021) [novotny2022when].model_files (iterable of (Path, int)) – The individual files of the log-bilinear language model with their sizes in bytes.
cache_files (iterable of (Path, int)) – The individual cached data of the log-bilinear language model with their sizes in bytes.
References
The general log-bilinear language model was developed by Mikolov et al. (2013) [mikolov2013efficient]. The subword model was developed by Bojanowski et al. (2017) [bojanowski2017enriching]. The positional model was developed by Mikolov et al. (2018) [mikolov2018advances]. The constrained positional model was developed by Novotný et al. (2021) [novotny2022when].
- mikolov2013efficient
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781v3
- bojanowski2017enriching
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://arxiv.org/pdf/1607.04606.pdf
- mikolov2018advances
Mikolov, T., et al. “Advances in Pre-Training Distributed Word Representations.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018. http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf
- novotny2022when
Novotný, V., et al. “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting”. 2022. https://arxiv.org/abs/2104.09691v5
- property basename: str¶
- property cache_dir: pathlib.Path¶
- property cache_files: Iterable[Tuple[pathlib.Path, int]]¶
- property classified_context_words: Dict[str, str]¶
- classify_context_word(word: str) str¶
Classify a context word to a cluster of positional features. [novotny2022when]
- Parameters
word (str) – A context word.
- Returns
cluster_label – A label of a cluster of positional features to which the context word has been classified.
- Return type
str
- property corpus: Iterable[Iterable[str]]¶
- property corpus_dir: pathlib.Path¶
- property dataset_dir: pathlib.Path¶
- property fasttext_parameters: Dict¶
- property friendly_name: str¶
- get_masked_word_probability(sentence: Sentence, masked_word: str, cluster_label: Optional[str] = None) SentenceProbability¶
Get the probability of a sentence given a masked word.
- Parameters
sentence (
Sentence) – A sentence.masked_word (str) – A masked word.
cluster_label ({str, NoneType}, optional) – The cluster of positional features [novotny2022when] to use for the prediction. By default, we will use all features (None).
- Returns
sentence_probability – The probability of the sentence given the masked word.
- Return type
- property input_vectors: numpy.ndarray¶
- property language_modeling: LanguageModelingResult¶
- property model: gensim.models.fasttext.FastText¶
- property model_dir: pathlib.Path¶
- property model_files: Iterable[Tuple[pathlib.Path, int]]¶
- property output_vectors: numpy.ndarray¶
- property position_importance: PositionImportance¶
- property positional_feature_clusters: ClusteredPositionalFeatures¶
- property positional_vectors: numpy.ndarray¶
- predict_masked_words(sentence: Sentence) Iterable[str]¶
Predict masked words for a sentence.
- Parameters
sentence (
Sentence) – A sentence.- Returns
masked_words – The predicted masked words for the sentence in a descending order of probability.
- Return type
iterable of str
- print_files()¶
Pretty-print the individual files and cached data of the log-bilinear language model on the standard output.
- produce_example_sentences(cluster_label: str) ExampleSentences¶
Produce two example sentences that characterize a cluster of positional features. [novotny2022when]
A context word from a cluster of positional features will be placed on two different positions of a sentence, where it produces the greatest difference in masked word predictions. This is a useful illustration of the behavior and the purpose of a cluster of positional features.
- Parameters
cluster_label (str) – A label of a cluster of positional features.
- Returns
example_sentences – Two example sentences that characterize the cluster of positional features.
- Return type
- property training_duration: float¶
- property vectors: gensim.models.keyedvectors.KeyedVectors¶
- property word_analogy: WordAnalogyResult¶
- property words: Sequence[str]¶
- class pine.language_model.TrainingDurationMeasure¶
Bases:
gensim.models.callbacks.CallbackAny2Vec- on_epoch_begin(model)¶
Method called at the start of each epoch.
- Parameters
model (
Word2Vecor subclass) – Current model.
- on_epoch_end(model)¶
Method called at the end of each epoch.
- Parameters
model (
Word2Vecor subclass) – Current model.
pine.util module¶
- pine.util.download_to(url: str, path: pathlib.Path, size: Optional[int] = None, transformation: Optional[Callable[[str], str]] = None, extract_file: Optional[pathlib.Path] = None, buffer_size: int = 1048576)¶
- pine.util.interpolate(X: numpy.ndarray, Y: numpy.ndarray, kind: Optional[str] = None) Tuple[numpy.ndarray, numpy.ndarray]¶
- pine.util.parallel_simple_preprocess(pool, path: pathlib.Path, semaphore) Iterable[List[str]]¶
- pine.util.produce(iterable: Iterable[pine.util.T], semaphore) Iterable[pine.util.T]¶
- pine.util.simple_preprocess(document: str) List[str]¶
- pine.util.stringify_parameters(parameters: Dict) str¶
- pine.util.unzip_to(archive: pathlib.Path, result_dir: pathlib.Path, unlink_after: bool = False)¶
Module contents¶
- class pine.LanguageModel(corpus: Union[str, Iterable[Iterable[str]]], workspace: Union[pathlib.Path, str] = '.', language: str = 'en', subwords: bool = True, positions: Union[bool, str] = 'constrained', use_vocab_from: Optional[pine.language_model.LanguageModel] = None, friendly_name: Optional[str] = None, extra_fasttext_parameters: Optional[Dict] = None)¶
Bases:
objectA log-bilinear language model.
- Parameters
corpus ({str, Corpus}) – The name of the corpus on which the language model will be trained. See the
get_corpus()for a list of available corpora for a given language and their names. Alternatively, a custom corpus.workspace ({Path, str}, optional) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel. The default workspace is the current working directory (.).
language (str, optional) – The language of the model. Determines the corpora and the evaluation tasks available for the model. The default language is English (en).
subwords (bool, optional) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events. By default, the model will use subwords (True).
positions ({bool,str}, optional) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2022when] By default, the model will use reduced dimensionality for positions (constrained).
use_vocab_from ({LanguageModel, NoneType}, optional) – Another trained log-bilinear language model to borrow corpus statistics (vocab) from to speed up the training. The other model must have been trained on the same corpus in the same language as this model. By default, we will do a preliminary pass over the corpus to retrieve the vocab (None).
friendly_name ({str, NoneType}, optional) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model. By default, the friendly name of the model will be automatically generated according to its parameters (None).
extra_fasttext_parameters ({dict, NoneType}, optional) – Extra parameters for the log-bilinear language model to override the defaults from
FASTTEXT_PARAMETERS. There parameters are for the gensim.models.fasttext.FastText class constructor. By default, no extra parameters will be used (None).
- Variables
corpus (
Corpus) – The corpus on which the language model will be trained. a given language.workspace (Path) – The path to a workspace in which corpora, models, evaluation results, and cached data will be stored. The workspace ensures that when you later reinitialize a model with the same parameters, it is not necessary to retrain it and all available data will be loaded from the workspace. This is convenient for experiments, although not so convenient for productization. For productization, you should use the low-level gensim.models.fasttext.FastText class instead of LanguageModel.
language (str) – The language of the model. Determines the corpora and the evaluation tasks available for the model.
subwords (bool) – Whether the model will use subwords as well as words during the training. Using subwords is known to improve the speed of convergence for log-bilinear models, especially for inflected natural languages. [bojanowski2017enriching] However, subwords may not exist in domains outside natural language processing where corpora can sequences of arbitrary events.
positions ({bool,str}) – Whether the model will use positions of context words relative to the masked input word during the training. Using positions is known to reduce the perplexity of log-bilinear models. [mikolov2018advances] The options are to not use positions (False), to use full dimensionality for positions (full), or to use reduced dimensionality for positions (constrained). Using reduced dimensionality for positions has been shown to improve the speed of training and also the speed of convergence for log-bilinear models. [novotny2022when]
friendly_name (str) – Your own name for the log-bilinear language model, which will be used in outputs and in visualizations. The name will not affect the workspace, i.e. initializing a model with a different friendly name does not mean that you need to retrain the model.
fasttext_parameters (dict) – The full parameters of the log-bilinear language model for the gensim.models.fasttext.FastText class constructor.
model (gensim.models.fasttext.FastText) – The raw log-bilinear language model.
vectors (gensim.models.keyedvectors.KeyedVectors) – The word, subword, and positional embeddings of the log-bilinear language model.
training_duration (float) – The training duration of the model in seconds.
input_vectors (np.ndarray) – The input word vectors of the log-bilinear language model.
output_vectors (np.ndarray) – The output word vectors of the log-bilinear language model.
positional_vectors (np.ndarray) – The input positional vectors of the log-bilinear language model.
position_importance (
PositionImportance) – The importance of positions. [novotny2022when]positional_feature_clusters (
ClusteredPositionalFeatures) – Clusters of positional features. [novotny2022when]words (sequence of str) – All words in the dictionary of the log-bilinear model.
classified_context_words (dict of (str, str)) – All words in the dictionary of the log-bilinear model, classified to the individual clusters of positional features. [novotny2022when]
word_analogy (
WordAnalogyResult) – The results of the log-bilinear language model on the word analogy task of Mikolov et al. (2013) [mikolov2013efficient].language_modeling (
LanguageModelingResult) – The results of the log-bilinear language model on the language modeling task of Novotný et al. (2021) [novotny2022when].model_files (iterable of (Path, int)) – The individual files of the log-bilinear language model with their sizes in bytes.
cache_files (iterable of (Path, int)) – The individual cached data of the log-bilinear language model with their sizes in bytes.
References
The general log-bilinear language model was developed by Mikolov et al. (2013) [mikolov2013efficient]. The subword model was developed by Bojanowski et al. (2017) [bojanowski2017enriching]. The positional model was developed by Mikolov et al. (2018) [mikolov2018advances]. The constrained positional model was developed by Novotný et al. (2021) [novotny2022when].
- mikolov2013efficient
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781v3
- bojanowski2017enriching
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146. https://arxiv.org/pdf/1607.04606.pdf
- mikolov2018advances
Mikolov, T., et al. “Advances in Pre-Training Distributed Word Representations.” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018. http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf
- novotny2022when
Novotný, V., et al. “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting”. 2022. https://arxiv.org/abs/2104.09691v5
- property basename: str¶
- property cache_dir: pathlib.Path¶
- property cache_files: Iterable[Tuple[pathlib.Path, int]]¶
- property classified_context_words: Dict[str, str]¶
- classify_context_word(word: str) str¶
Classify a context word to a cluster of positional features. [novotny2022when]
- Parameters
word (str) – A context word.
- Returns
cluster_label – A label of a cluster of positional features to which the context word has been classified.
- Return type
str
- property corpus: Iterable[Iterable[str]]¶
- property corpus_dir: pathlib.Path¶
- property dataset_dir: pathlib.Path¶
- property fasttext_parameters: Dict¶
- property friendly_name: str¶
- get_masked_word_probability(sentence: Sentence, masked_word: str, cluster_label: Optional[str] = None) SentenceProbability¶
Get the probability of a sentence given a masked word.
- Parameters
sentence (
Sentence) – A sentence.masked_word (str) – A masked word.
cluster_label ({str, NoneType}, optional) – The cluster of positional features [novotny2022when] to use for the prediction. By default, we will use all features (None).
- Returns
sentence_probability – The probability of the sentence given the masked word.
- Return type
- property input_vectors: numpy.ndarray¶
- property language_modeling: LanguageModelingResult¶
- property model: gensim.models.fasttext.FastText¶
- property model_dir: pathlib.Path¶
- property model_files: Iterable[Tuple[pathlib.Path, int]]¶
- property output_vectors: numpy.ndarray¶
- property position_importance: PositionImportance¶
- property positional_feature_clusters: ClusteredPositionalFeatures¶
- property positional_vectors: numpy.ndarray¶
- predict_masked_words(sentence: Sentence) Iterable[str]¶
Predict masked words for a sentence.
- Parameters
sentence (
Sentence) – A sentence.- Returns
masked_words – The predicted masked words for the sentence in a descending order of probability.
- Return type
iterable of str
- print_files()¶
Pretty-print the individual files and cached data of the log-bilinear language model on the standard output.
- produce_example_sentences(cluster_label: str) ExampleSentences¶
Produce two example sentences that characterize a cluster of positional features. [novotny2022when]
A context word from a cluster of positional features will be placed on two different positions of a sentence, where it produces the greatest difference in masked word predictions. This is a useful illustration of the behavior and the purpose of a cluster of positional features.
- Parameters
cluster_label (str) – A label of a cluster of positional features.
- Returns
example_sentences – Two example sentences that characterize the cluster of positional features.
- Return type
- property training_duration: float¶
- property vectors: gensim.models.keyedvectors.KeyedVectors¶
- property word_analogy: WordAnalogyResult¶
- property words: Sequence[str]¶