funNLP Chinese/English NLP Toolkit and Resource Library
Set up and use fighting41love/funNLP - a comprehensive collection of 80,000+ starred Chinese NLP resources including sensitive word detection, language detection, datasets, and tools.
- Step 1
What is funNLP?
funNLP (fighting41love/funNLP) is one of the most comprehensive Chinese NLP resource collections on GitHub with over 80,000 stars. Unlike a traditional package, it's a curated repository that links to external NLP tools, datasets, and resources. It covers sensitive word detection, language detection, phone carrier lookup, name gender inference, email/ID extraction, Chinese/Japanese name databases, vocabulary sentiment values, stop words, and much more.
- Step 2
Technology Stack
The funNLP project collects resources across multiple technologies:
Primary Language: Python (365 files in repository)
Categories Covered:
- Text Processing: jieba, HanLP, THULAC, LTP, NLTK, spaCy
- Deep Learning Frameworks: PyTorch, TensorFlow, Keras
- Pre-trained Models: BERT, RoBERTa, ALBERT, GPT-2, XLNet, ELECTRA, LLaMA
- NLP Tasks: Named Entity Recognition (NER), Text Classification, Sentiment Analysis, Text Summarization, Machine Translation, Question Answering, Text Generation
- Knowledge Graphs: Neo4j, AllegroGraph, Jena
- OCR: cnocr, PaddleOCR, Tesseract
- Speech Recognition: ASR datasets and tools
- Visualization: Matplotlib, Seaborn, Plotly
Data Formats: CSV, JSON, TXT, Pickle, HDF5, Parquet
Core Technologies in funNLP: - Python 3.x (primary language) - Deep Learning: PyTorch, TensorFlow 2.x, Keras - NLP Libraries: jieba, HanLP, THULAC, StanfordNLP, spaCy - Transformers: HuggingFace transformers, tokenizers, datasets - Data Processing: pandas, numpy, scipy - Model Serving: FastAPI, Flask, Gradio - Visualization: matplotlib, seaborn, pyecharts - Step 3
Repository Structure
funNLP is organized as a README-based resource catalog with supporting data files:
funNLP/ ├── README.md # Main catalog with categorized links ├── .github/ # GitHub configuration └── data/ # Supporting data files ├── .logo图片/ # Logo images ├── IT词库/ # IT terminology dictionary ├── 中文分词词库整理/ # Chinese word segmentation dictionaries ├── 中文缩写库/ # Chinese abbreviation library ├── 中日文名字库/ # Chinese/English/Japanese name databases ├── 停用词/ # Stop words ├── 公司名字词库/ # Company name dictionaries ├── 动物词库/ # Animal vocabularies ├── 医学词库/ # Medical terminology ├── 历史名人词库/ # Historical figures ├── 古诗词库/ # Ancient poetry ├── 同义词库、反义词库/ # Synonyms, antonyms ├── 地名词库/ # Place names ├── 成语词库/ # Chinese idioms ├── 法律词库/ # Legal terminology ├── 繁简体转换词库/ # Traditional/Simplified conversion ├── 职业词库/ # Occupation vocabulary ├── 财经词库/ # Financial terminology └── 食物词库/ # Food vocabularyThe main README.md contains categorized links to external repositories, papers, and tools.
- Step 4
Usage Approach
Since funNLP is a resource catalog rather than a package, you don't install it directly. Instead:
Option 1: Clone for reference
git clone https://github.com/fighting41love/funNLP.git cd funNLPOption 2: Browse online Visit https://github.com/fighting41love/funNLP to browse the categorized resource list.
Option 3: Download specific dictionaries Clone only specific data files you need using
git sparse-checkoutor download individual files from the data/ directory.# Clone the repository for offline reference git clone https://github.com/fighting41love/funNLP.git # Or use sparse checkout for specific directories only git clone --filter=blob:none --sparse https://github.com/fighting41love/funNLP.git cd funNLP git sparse-checkout add "data/停用词" "data/中文分词词库整理" git checkout⚠ Heads up: The repository is a large resource catalog. Consider using sparse checkout if you only need specific dictionaries or data files. - Step 5
Key Resource Categories
The funNLP README organizes resources into major categories:
Core NLP Tasks:
- Text Classification (文本分类)
- Named Entity Recognition / Information Extraction (命名实体识别/信息抽取)
- Sentiment Analysis (情感分析)
- Text Summarization (文本摘要)
- Text Generation (文本生成)
- Question Answering (智能问答)
- Machine Translation (机器翻译)
- Text Similarity/Matching (文本匹配)
- Spelling Correction (文本纠错)
LLM & ChatGPT Resources:
- ChatGPT-like model benchmarks (类 ChatGPT 模型评测)
- LLM training and inference (LLM 训练_推理)
- Prompt Engineering (提示工程)
- RAG/Dataset for LLMs (LLM 数据集)
- Industry Applications (行业应用)
Data Resources:
- Corpora (语料库): Chinese/English training datasets
- Dictionaries (词库): Stop words, sensitive words, specialized vocabularies
- Knowledge Graphs (知识图谱): Neo4j tutorials, KG construction tools
## funNLP Major Categories (from README): ### LLM & ChatGPT Section - 类 ChatGPT 的模型评测对比 (Model benchmarks) - 类 ChatGPT 的资料 (Research papers) - 类 ChatGPT 的开源框架 (Open source frameworks) - LLM 的训练_推理_低资源_高效训练 (Training & inference) - 提示工程 (Prompt Engineering) - 类 ChatGPT 的文档问答 (RAG/DQA) - 多模态 LLM (Multimodal LLMs) ### Traditional NLP - 语料库 (Corpora) - 词库及词法工具 (Dictionaries & lexicon tools) - 预训练语言模型 (Pre-trained language models) - 抽取 (Information extraction) - 知识图谱 (Knowledge graphs) - 文本生成 (Text generation) - 文本摘要 (Text summarization) - 智能问答 (Question answering) - 文本纠错 (Spelling correction) - 文档处理 (Document processing) - 表格处理 (Table processing) - 文本匹配 (Text matching) - 文本分类 (Text classification) - 情感分析 (Sentiment analysis) - 机器翻译 (Machine translation) - 语音处理 (Speech processing) - Step 6
Accessing the Data Files
The data/ directory contains actual downloadable resources:
**Chinese Stop Words **(停用词)
cd funNLP/data/停用词 # Contains multiple stop word lists: stopWords.txt, CNStopWord.txt, etc.Chinese Name Databases
cd funNLP/data/中日文名字库/ # Contains: Chinese names, Japanese names, English names with gender predictionsSpecialized Dictionaries Medical, legal, financial, and other domain-specific vocabularies.
# Example: Load stop words from the repository import os # Define path to stop words stop_words_path = 'funNLP/data/停用词/Tencent.txt' # or other stop word files with open(stop_words_path, 'r', encoding='utf-8') as f: stop_words = set(line.strip() for line in f if line.strip()) print(f'Loaded {len(stop_words)} stop words') # Filter text def remove_stop_words(text, stop_words, separator=' '): words = text.split(separator) filtered = [word for word in words if word not in stop_words] return separator.join(filtered) sample = "这是一个测试文本,包含一些停用词" print(remove_stop_words(sample, stop_words, separator='')) - Step 7
Common Use Cases
Use Case 1: Content Moderation Use the sensitive word dictionaries to filter inappropriate content in Chinese text.
Use Case 2: NLP Model Training Find pre-processed datasets and training corpora for various NLP tasks.
Use Case 3: Resource Discovery Discover new NLP tools and libraries by browsing the categorized link list.
Use Case 4: Dictionary Lookup Access specialized Chinese dictionaries for domain-specific NLP tasks (medical, legal, financial).
# Use case: Load multiple dictionaries for domain-specific NLP # Load medical terminology with open('funNLP/data/医学词库/医学词库.txt', 'r', encoding='utf-8') as f: medical_terms = [line.strip() for line in f if line.strip()] # Load financial terminology with open('funNLP/data/财经词库/财经词库.txt', 'r', encoding='utf-8') as f: financial_terms = [line.strip() for line in f if line.strip()] print(f'Loaded {len(medical_terms)} medical terms') print(f'Loaded {len(financial_terms)} financial terms') # Use with jieba to improve segmentation import jieba jieba.load_userdict('funNLP/data/医学词库/医学词库.txt') text = "患者出现头痛和发热症状" seg_list = jieba.cut(text) print(list(seg_list)) - Step 8
Integrating with External Tools
funNLP links to many external tools. Here are common integration patterns:
**Using linked tools **(via the README links)
- Browse the README.md to find tools for your specific task
- Follow the link to the external repository
- Install and use those tools according to their documentation
Example tools linked from funNLP:
- jieba: Chinese word segmentation
- HanLP: Complete NLP toolkit for Chinese
- THULAC: Tsinghua online Chinese word segmentation
- LTP: Peking University Language Tech Platform
- pkuseg: Peking University NLP segmentation
- SnowNLP: Simple Chinese text analysis
# Install popular NLP tools mentioned in funNLP # Basic Chinese NLP tools pip install jieba # Chinese word segmentation pip install snownlp # Chinese sentiment analysis pip install pkuseg # Peking University segmenter pip install thulac # Tsinghua segmenter # Deep Learning NLP pip install transformers # HuggingFace transformers pip install torch # PyTorch pip install spacy # Industrial-strength NLP pip install nltk # Natural Language Toolkit # For model serving pip install fastapi # Fast API for NLP services pip install gradio # Easy NLP model demos - Step 9
Finding Specific Resources
The funNLP README is organized with Markdown anchors. Use these search strategies:
Search in README: Use grep or GitHub search to find specific entries.
Browse data directories directly: List the data/ directory to find available dictionaries.
# Search for specific resources in funNLP cd funNLP # Find sentiment analysis resources grep -i "情感分析" README.md | head -20 # Find NER resources grep -i "命名实体" README.md | head -20 # Find BERT resources grep -i "BERT" README.md | head -20 # List all available data dictionaries ls -la data/ # Read a specific dictionary file head -20 data/停用词/stopWords.txt - Step 10
Additional Information
Repository Statistics:
- Stars: 80,000+
- Forks: 15,000+
- Primary Language: Python
- Created: August 2018
- Last Updated: May 2024
- Size: ~174 MB
Author: fighting41love (GitHub) Homepage: https://zhuanlan.zhihu.com/yangyangfuture
License: No explicit license declared - check individual resources for their own licensing.
⚠ Heads up: Each linked resource has its own licensing terms. Check licenses before using in commercial projects. The funNLP repository itself does not have an explicit license file. - Step 11
Next Steps
After exploring funNLP:
- Identify your NLP task: Browse the categorized links to find tools for your specific use case
- Install required tools: Use pip or follow README installation instructions for specific tools
- Download needed data: Copy relevant dictionaries from the data/ directory to your project
- Explore linked repositories: Follow links to in-depth resources and implementations
- Set up your development environment: Install Python 3.8+, required libraries, and model dependencies
Related resources:
- HuggingFace Hub for pre-trained models
- Papers With Code for SOTA benchmarks
- ArXiv for latest NLP research
## Quick Links from funNLP ### Essential Chinese NLP Tools - [jieba](https://github.com/fxsjy/jieba) - Chinese Word Segmentation - [HanLP](https://github.com/hankcs/HanLP) - Comprehensive Chinese NLP - [THULAC](https://github.com/thuzhuthulac/thulac-python) - Tsinghua Segmenter - [LTP](https://github.com/PKU1ALM/ltp) - Peking University LTP - [pkuseg](https://github.com/lancy/pkuseg) - Peking University Segmenter - [SnowNLP](https://github.com/isnowfy/snownlp) - Chinese Text Analysis ### Pre-trained Models - [HuggingFace Transformers](https://github.com/huggingface/transformers) - [BERT](https://github.com/google-research/bert) - [RoBERTa](https://github.com/pytorch/fairseq) - [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) ### Datasets & Corpora - [CLUE](https://github.com/CLUEbenchmark/CLUE) - Chinese NLP Benchmark - [C3](https://github.com/IDEA-Research/TopiConv) - Conversation Dataset - [AFQMC](https://github.com/AIFund-2021/AIQMC) - Text Semantic Similarity
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.