current position：Home>"Python instance" was shocked and realized the dirty words and advertisement detection of the chat system with Python
"Python instance" was shocked and realized the dirty words and advertisement detection of the chat system with Python
2022-01-30 06:25:44 【Coriander Chat Game】
Little knowledge , Great challenge ！ This article is participating in “ A programmer must have a little knowledge
This article also participates in 「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund
In the game, chat function is almost a necessary function , There are some problems with this function, that is, it will cause the world channel to be very chaotic , There are often some sensitive words , Or chat that some game manufacturers don't want to see , We had this problem in the game before , Our company has done reporting and background monitoring , Let's realize this kind of monitoring today .
1、 Demand analysis ：
Because deep learning is not good , Although I have written about reinforcement learning before , But the results of reinforcement learning are not particularly satisfactory , So study the simpler method to achieve .
There are ready-made solutions for this classification task , For example, the classification of spam is the same problem , Although there are different solutions , But I chose the simplest naive Bayesian classification . Mainly do some exploration ,
Because most of our games are in Chinese , So we need to segment Chinese words , For example, I'm a handsome guy , Break it up .
2、 Algorithm principle ：
Naive bayes algorithm , It is an algorithm to judge the category of the new sample according to the conditional probability of the existing characteristics of the new sample in the data set ; It assumes that ① Each feature is independent of each other 、② Each feature is equally important . It can also be understood as judging the probability when the current characteristics are satisfied at the same time according to the past probability . Specific math companies can baidu themselves , The data formula is too hard to write , Just know about it .
Use the right algorithm at the right time .
jieba The principle of word segmentation ：jieba Word segmentation belongs to probabilistic language model . The task of probabilistic language model word segmentation is ： In all the results of total segmentation, find a segmentation scheme S, bring P(S) Maximum .
You can see jieba I brought some phrases , During segmentation, these phrases will be split as the base unit .
notes ： I just briefly introduce the principles of the above two technologies , If you want to fully understand, you have to write another big article , Can baidu next , Everywhere, , Just find something you can understand . If you can use it, use it first .
3、 Technical analysis
Chinese word segmentation bag the most famous word segmentation bag is jieba, As for whether it is the best, I don't know , I think fire has its reason , Do it first .jieba Don't delve into the principle of , Give priority to solving problems , When you encounter problems, you can learn from the problem points , Such a learning model is the most efficient .
Because I've been doing voice related things recently , A big man recommended Library nltk, Looking up the relevant information , It seems to be a well-known library for language processing , Very powerful , It's very powerful , I mainly chose his classification algorithm here , So I don't have to focus on the specific implementation , You don't have to build wheels again , Besides, it's not as good as others , Just use it .
python It's really nice , All kinds of bags , All kinds of wheels .
Installation command ：
pip install jieba pip install nltk Copy code
Enter the above two codes respectively , After running , The package is installed successfully , You can test happily
""" #Author: Coriander @time: 2021/8/5 0005 Afternoon 10:26 """ import jieba if __name__ == '__main__': result = " | ".join(jieba.cut(" I love tian 'anmen square in Beijing ,very happy")) print(result) Copy code
Look at the word segmentation results , It can be said that it is very good , Sure enough, a major is a major .
4、 Source code
Simple tests were done , It can be found that we basically have everything to complete , Now start working directly on the code .
1、 Load the initial text resource .
2、 Remove punctuation marks from text
3、 Feature extraction of text
4、 Training data set , Training out models （ That is, the prediction model ）
5、 Start testing new words
#!/usr/bin/env python # encoding: utf-8 import re import jieba from nltk.classify import NaiveBayesClassifier """ #Author: Coriander @time: 2021/8/5 0005 Afternoon 9:29 """ rule = re.compile(r"[^a-zA-Z0-9\u4e00-\u9fa5]") def delComa(text): text = rule.sub('', text) return text def loadData(fileName): text1 = open(fileName, "r", encoding='utf-8').read() text1 = delComa(text1) list1 = jieba.cut(text1) return " ".join(list1) # feature extraction def word_feats(words): return dict([(word, True) for word in words]) if __name__ == '__main__': adResult = loadData(r"ad.txt") yellowResult = loadData(r"yellow.txt") ad_features = [(word_feats(lb), 'ad') for lb in adResult] yellow_features = [(word_feats(df), 'ye') for df in yellowResult] train_set = ad_features + yellow_features # Training decisions classifier = NaiveBayesClassifier.train(train_set) # Analysis test sentence = input(" Please enter a sentence ：") sentence = delComa(sentence) print("\n") seg_list = jieba.cut(sentence) result1 = " ".join(seg_list) words = result1.split(" ") print(words) # The statistical results ad = 0 yellow = 0 for word in words: classResult = classifier.classify(word_feats(word)) if classResult == 'ad': ad = ad + 1 if classResult == 'ye': yellow = yellow + 1 # The proportion x = float(str(float(ad) / len(words))) y = float(str(float(yellow) / len(words))) print(' The possibility of advertising ：%.2f%%' % (x * 100)) print(' The possibility of swearing ：%.2f%%' % (y * 100)) Copy code
Look at the results of the operation
Download address of all resources ：download.csdn.net/download/pe…
1、 The data source can be modified , The monitored data can be stored in the database for loading
2、 You can classify more data , It is convenient for customer service to handle , For example, it is divided into advertising , dirty language , Advice to officials, etc , Define according to business requirements
3、 Data with high probability can be automatically processed by other systems , Improve the speed of dealing with problems
4、 You can use the player's report , Increase the accumulation of data
5、 This idea can be used as a treatment of sensitive words , Provide a dictionary of sensitive words , And then match , testing
6、 It can be made into web service , Play a callback game
7、 The model can be made to predict while learning , For example, some cases need to be handled manually by customer service , After marking, it is directly added to the dataset , In this way, the data model can be learned all the time s
6、 Problems encountered
1、 Problems encountered , Punctuation problem , If punctuation is not removed, it will lead to matching. Punctuation is also regarded as matching , unreasonable .
2、 Coding problem , It reads binary , It took a long time to solve
3、 Technology selection , At the beginning, I wanted to use deep learning to solve , I also saw some solutions , However, my computer training is too slow , First choose this way to practice
4、 The code is simple , But it's hard to explain Technology , The code is already written , But it took a weekend to write this article
7、 summary ：
If you encounter problems, go to the technical solution , If you know the plan, implement it , encounter bug Just go and check , If you can't forget, there will be echoes , Any attempt you make is a good opportunity to learn
author[Coriander Chat Game],Please bring the original link to reprint, thank you.
The sidebar is recommended
- [recalling the 1970s] using Python to repair the wonderful memories of parents' generation, black-and-white photos become color photos
- You used to know Python advanced
- Pyinstaller package Python project
- 2021 IEEE programming language rankings: Python tops the list!
- Implementation of Python automatic test control
- Python advanced: [Baidu translation reverse] graphic and video teaching!!!
- Do you know the fuzzy semantics in Python syntax?
- [Python from introduction to mastery] (XXVII) learn more about pilot!
- Playing excel office automation with Python
- Some applications of heapq module of Python module
guess what you like
Python and go languages are so popular, which is more suitable for you?
Python practical skills task segmentation
Python simulated Login, numpy module, python simulated epidemic spread
Python opencv contour discovery function based on image edge extraction
Application of Hoff circle detection in Python opencv
Python reptile test ox knife (I)
Day 1: learn the Django framework of Python development
django -- minio_ S3 file storage service
[algorithm learning] 02.03 Delete intermediate nodes (Java / C / C + + / Python / go)
Similarities and differences of five pandas combinatorial functions
- Learning in Python + opencv -- extracting corners
- Python beginner's eighth day ()
- Necessary knowledge of Python: take you to learn regular expressions from zero
- Get your girlfriend's chat records with Python and solve the paranoia with one move
- My new book "Python 3 web crawler development practice (Second Edition)" has been recommended by the father of Python!
- From zero to familiarity, it will take you to master the use of Python len() function
- Python type hint type annotation guide
- leetcode 108. Convert Sorted Array to Binary Search Tree（python）
- For the geometric transformation of Python OpenCV image, let's first talk about the extraordinary resize function
- leetcode 701. Insert into a Binary Search Tree （python）
- For another 3 days, I sorted out 80 Python datetime examples, which must be collected!
- Python crawler actual combat | using multithreading to crawl lol HD Wallpaper
- Complete a python game in 28 minutes, "customer service play over the president card"
- The universal Python praise machine (commonly known as the brushing machine) in the whole network. Do you want to know the principle? After reading this article, you can write one yourself
- How does Python compare file differences between two paths
- Common OS operations for Python
- [Python data structure series] linear table - explanation of knowledge points + code implementation
- How Python parses web pages using BS4
- How do Python Network requests pass parameters
- Python core programming - decorator
- Python Network Programming -- create a simple UPD socket to realize mutual communication between two processes
- leetcode 110. Balanced Binary Tree（python）
- Django uses Django celery beat to dynamically add scheduled tasks
- The bear child said "you haven't seen Altman" and hurriedly studied it in Python. Unexpectedly
- Optimization iteration of nearest neighbor interpolation and bilinear interpolation algorithm for Python OpenCV image
- Bilinear interpolation algorithm for Python OpenCV image, the most detailed algorithm description in the whole network
- Use of Python partial()
- Python game development, pyGame module, python implementation of angry birds
- leetcode 1104. Path In Zigzag Labelled Binary Tree（python）
- Save time and effort. 10 lines of Python code automatically clean up duplicate files in the computer
- Learn python, know more meat, and be a "meat expert" in the technical circle. One article is enough
- [Python data structure series] "stack (sequential stack and chain stack)" -- Explanation of knowledge points + code implementation
- Datetime module of Python time series
- Python encrypts and decrypts des to solve the problem of inconsistency with Java results
- Chapter 1: introduction to Python programming-4 Hello World
- Summary of Python technical points
- 11.5K Star！ An open source Python static type checking Library
- Chapter 2: Fundamentals of python-1 grammar
- [Python daily homework] day4: write a function to count the number of occurrences of each number in the incoming list and return the corresponding dictionary.
- Python uses turtle to express white