Metadata-Version: 2.1
Name: PgsFile
Version: 0.2.1
Summary: This module aims to simplify Python package management, script execution, file handling, web scraping, multimedia download, data cleaning, NLP tasks, and word list generation for literary students, making it more accessible and convenient to use.
Home-page: https://mp.weixin.qq.com/s/12-KVLfaPszoZkCxuRd-nQ?token=1589547443&lang=zh_CN
Author: Pan Guisheng
Author-email: 895284504@qq.com
License: Educational free
Classifier: Programming Language :: Python :: 3
Classifier: License :: Free For Educational Use
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: python-docx
Requires-Dist: pip
Requires-Dist: requests
Requires-Dist: fake-useragent
Requires-Dist: lxml
Requires-Dist: pimht
Requires-Dist: pysbd
Requires-Dist: nlpir-python

Purpose: This module aims to assist Python beginners, particularly instructors and students of foreign languages and literature, by providing a convenient way to manage Python packages, run Python scripts, and perform operations on various file types such as txt, xlsx, json, tsv, html, mhtml, and docx. It also includes functionality for data scraping, cleaning and generating word lists.


Function 1: Enables efficient data retrieval and storage in files with a single line of code.

Function 2: Facilitates retrieval of all absolute file paths and file names in any folder (including sub-folders) with a single line of code using "FilePath" and "FileName" functions.

Function 3: Simplifies creation of word lists and frequency sorting from a file or batch of files using "word_list" and "batch_word_list" functions in PgsFile.

Function 4: Pgs-Corpora is a comprehensive language resource included in this library, featuring a monolingual corpus of native and translational Chinese and native and non-native English, as well as a bi-directional parallel corpus of Chinese and English texts covering financial, legal, political, academic, and sports news topics. Additionally, the library includes a collection of 8774 English idioms, stopwords for 28 languages, and a termbank of Chinese thought and culture.

Function 5: This library provides support for common text cleaning tasks, such as removing empty text, empty lines, and folders containing empty text. It also offers functions for converting full-width characters to half-width characters and vice versa, as well as standardizing the format of Chinese and English punctuation. These features can help improve the quality and consistency of text data used in natural language processing tasks.

Function 6: It also manages Python package installations and uninstallations, and allows running scripts and commands in Python interactive command lines instead of Windows command prompt.

Function 7: Download audiovisual files like videos, images, and audio using audiovisual_downloader, which is extremely useful and efficient. Additionally, scrape newspaper data with PGScraper, a highly efficient tool for this purpose.

Table 1: The directory and size of Pgs-Corpora
├── Idioms (1, 171.78 KB)
├── Monolingual (2197, 63.65 MB)
│   ├── Chinese (456, 15.27 MB)
│   │   ├── People's Daily 20130605 (396, 1.38 MB)
│   │   │   ├── Raw (132, 261.73 KB)
│   │   │   ├── Seg_only (132, 471.47 KB)
│   │   │   └── Tagged (132, 675.30 KB)
│   │   └── Translational Fictions (60, 13.89 MB)
│   └── English (1741, 48.38 MB)
│       ├── Native (65, 44.14 MB)
│       │   ├── A Short Collection of British Fiction (27, 33.90 MB)
│       │   └── Preschoolers- and Teenagers-oriented Texts in English (36, 10.24 MB)
│       ├── Non-native (1675, 3.63 MB)
│       │   └── Shanghai Daily (1675, 3.63 MB)
│       │       └── Business_2019 (1675, 3.63 MB)
│       │           ├── 2019-01-01 (1, 3.35 KB)
│       │           ├── 2019-01-02 (1, 3.65 KB)
│       │           ├── 2019-01-03 (7, 10.90 KB)
│       │           ├── 2019-01-04 (5, 9.63 KB)
│       │           └── 2019-01-07 (4, 9.50 KB)
│       │           └── ... (and 245 more directories)
│       └── Translational (1, 622.57 KB)
├── Parallel (371, 24.67 MB)
│   ├── HK Financial and Legal EC Parallel Corpora (5, 19.17 MB)
│   ├── New Year Address_CE_2006-2021 (15, 147.49 KB)
│   ├── Sports News_CE_2010 (20, 66.42 KB)
│   ├── TED_EC_2017-2020 (330, 5.24 MB)
│   └── Xi's Speech_CE_2021 (1, 53.01 KB)
├── Stopwords (28, 88.09 KB)
└── Terminology (1, 2.20 MB)

...


Author: Pan Guisheng, a PhD student at the Graduate Institute of Interpretation and Translation of Shanghai International Studies University
E-mail: 895284504@qq.com
