pine.corpus package

Submodules

pine.corpus.common_crawl module

class pine.corpus.common_crawl.CommonCrawlSentences(desc: str, semaphore, corpus_dir: pathlib.Path, language: str)

Bases: object

pine.corpus.common_crawl.get_corpus_path(language: str, name: str, corpus_dir: pathlib.Path) pathlib.Path

pine.corpus.corpus module

class pine.corpus.corpus.LineSentence(name: str, path: pathlib.Path, size: int)

Bases: Iterable

pine.corpus.corpus.get_corpus(name: str, corpus_dir: pathlib.Path, language: str = 'en') Iterable[Iterable[str]]

Produces a given corpus in a given language.

Parameters
  • name (str) – A name of the corpus. Known corpus names are wikipedia, which corresponds to the Wikipedia, and common_crawl, which corresponds to the Deduplicated Common Crawl.

  • corpus_dir (Path) – The directory in which the corpus will be stored.

  • language (en) – The language of the corpus.

Returns

corpus – The given corpus in the given language.

Return type

Corpus

pine.corpus.wikipedia module

class pine.corpus.wikipedia.EnglishWikipediaSentences(desc: str, semaphore, percentage: float = 1.0)

Bases: object

pine.corpus.wikipedia.get_corpus_path(language: str, name: str, corpus_dir: pathlib.Path) pathlib.Path

Module contents

pine.corpus.get_corpus(name: str, corpus_dir: pathlib.Path, language: str = 'en') Iterable[Iterable[str]]

Produces a given corpus in a given language.

Parameters
  • name (str) – A name of the corpus. Known corpus names are wikipedia, which corresponds to the Wikipedia, and common_crawl, which corresponds to the Deduplicated Common Crawl.

  • corpus_dir (Path) – The directory in which the corpus will be stored.

  • language (en) – The language of the corpus.

Returns

corpus – The given corpus in the given language.

Return type

Corpus