pine.corpus package¶
Submodules¶
pine.corpus.common_crawl module¶
- class pine.corpus.common_crawl.CommonCrawlSentences(desc: str, semaphore, corpus_dir: pathlib.Path, language: str)¶
Bases:
object
- pine.corpus.common_crawl.get_corpus_path(language: str, name: str, corpus_dir: pathlib.Path) pathlib.Path¶
pine.corpus.corpus module¶
- class pine.corpus.corpus.LineSentence(name: str, path: pathlib.Path, size: int)¶
Bases:
Iterable
- pine.corpus.corpus.get_corpus(name: str, corpus_dir: pathlib.Path, language: str = 'en') Iterable[Iterable[str]]¶
Produces a given corpus in a given language.
- Parameters
name (str) – A name of the corpus. Known corpus names are wikipedia, which corresponds to the Wikipedia, and common_crawl, which corresponds to the Deduplicated Common Crawl.
corpus_dir (Path) – The directory in which the corpus will be stored.
language (en) – The language of the corpus.
- Returns
corpus – The given corpus in the given language.
- Return type
pine.corpus.wikipedia module¶
- class pine.corpus.wikipedia.EnglishWikipediaSentences(desc: str, semaphore, percentage: float = 1.0)¶
Bases:
object
- pine.corpus.wikipedia.get_corpus_path(language: str, name: str, corpus_dir: pathlib.Path) pathlib.Path¶
Module contents¶
- pine.corpus.get_corpus(name: str, corpus_dir: pathlib.Path, language: str = 'en') Iterable[Iterable[str]]¶
Produces a given corpus in a given language.
- Parameters
name (str) – A name of the corpus. Known corpus names are wikipedia, which corresponds to the Wikipedia, and common_crawl, which corresponds to the Deduplicated Common Crawl.
corpus_dir (Path) – The directory in which the corpus will be stored.
language (en) – The language of the corpus.
- Returns
corpus – The given corpus in the given language.
- Return type