# "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system with Python

2022-01-30 06:25:44

This article also participates in  「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund

In the game, chat function is almost a necessary function , There are some problems with this function, that is, it will cause the world channel to be very chaotic , There are often some sensitive words , Or chat that some game manufacturers don't want to see , We had this problem in the game before , Our company has done reporting and background monitoring , Let's realize this kind of monitoring today .

## 1、 Demand analysis ：

Because deep learning is not good , Although I have written about reinforcement learning before , But the results of reinforcement learning are not particularly satisfactory , So study the simpler method to achieve .

There are ready-made solutions for this classification task , For example, the classification of spam is the same problem , Although there are different solutions , But I chose the simplest naive Bayesian classification . Mainly do some exploration ,

Because most of our games are in Chinese , So we need to segment Chinese words , For example, I'm a handsome guy , Break it up .

## 2、 Algorithm principle ：

Naive bayes algorithm , It is an algorithm to judge the category of the new sample according to the conditional probability of the existing characteristics of the new sample in the data set ; It assumes that ① Each feature is independent of each other 、② Each feature is equally important . It can also be understood as judging the probability when the current characteristics are satisfied at the same time according to the past probability . Specific math companies can baidu themselves , The data formula is too hard to write , Just know about it .

Use the right algorithm at the right time .

jieba The principle of word segmentation ：jieba Word segmentation belongs to probabilistic language model . The task of probabilistic language model word segmentation is ： In all the results of total segmentation, find a segmentation scheme S, bring P(S) Maximum .

You can see jieba I brought some phrases , During segmentation, these phrases will be split as the base unit .

notes ： I just briefly introduce the principles of the above two technologies , If you want to fully understand, you have to write another big article , Can baidu next , Everywhere, , Just find something you can understand . If you can use it, use it first .

## 3、 Technical analysis

Chinese word segmentation bag the most famous word segmentation bag is jieba, As for whether it is the best, I don't know , I think fire has its reason , Do it first .jieba Don't delve into the principle of , Give priority to solving problems , When you encounter problems, you can learn from the problem points , Such a learning model is the most efficient .

Because I've been doing voice related things recently , A big man recommended Library nltk, Looking up the relevant information , It seems to be a well-known library for language processing , Very powerful , It's very powerful , I mainly chose his classification algorithm here , So I don't have to focus on the specific implementation , You don't have to build wheels again , Besides, it's not as good as others , Just use it .

python It's really nice , All kinds of bags , All kinds of wheels .

Installation command ：

``````pip install jieba
pip install nltk
Copy code ``````

Enter the above two codes respectively , After running , The package is installed successfully , You can test happily

``````"""
#Author:  Coriander
@time: 2021/8/5 0005  Afternoon  10:26
"""
import jieba

if __name__ == '__main__':
result = " | ".join(jieba.cut(" I love tian 'anmen square in Beijing ,very happy"))
print(result)
Copy code ``````

Look at the word segmentation results , It can be said that it is very good , Sure enough, a major is a major .

## 4、 Source code

Simple tests were done , It can be found that we basically have everything to complete , Now start working directly on the code .

1、 Load the initial text resource .

2、 Remove punctuation marks from text

3、 Feature extraction of text

4、 Training data set , Training out models （ That is, the prediction model ）

5、 Start testing new words

``````#!/usr/bin/env python
# encoding: utf-8
import re

import jieba
from nltk.classify import NaiveBayesClassifier

""" #Author:  Coriander  @time: 2021/8/5 0005  Afternoon  9:29 """
rule = re.compile(r"[^a-zA-Z0-9\u4e00-\u9fa5]")
def delComa(text):
text = rule.sub('', text)
return text

text1 = delComa(text1)
list1 = jieba.cut(text1)
return " ".join(list1)

#  feature extraction
def word_feats(words):
return dict([(word, True) for word in words])

if __name__ == '__main__':
yellow_features = [(word_feats(df), 'ye') for df in yellowResult]
#  Training decisions
classifier = NaiveBayesClassifier.train(train_set)

#  Analysis test
sentence = input(" Please enter a sentence ：")
sentence = delComa(sentence)
print("\n")
seg_list = jieba.cut(sentence)
result1 = " ".join(seg_list)
words = result1.split(" ")
print(words)
#  The statistical results
yellow = 0
for word in words:
classResult = classifier.classify(word_feats(word))
if classResult == 'ye':
yellow = yellow + 1
#  The proportion
y = float(str(float(yellow) / len(words)))
print(' The possibility of advertising ：%.2f%%' % (x * 100))
print(' The possibility of swearing ：%.2f%%' % (y * 100))
Copy code ``````

Look at the results of the operation

## 5、 Expand

1、 The data source can be modified , The monitored data can be stored in the database for loading

2、 You can classify more data , It is convenient for customer service to handle , For example, it is divided into advertising , dirty language , Advice to officials, etc , Define according to business requirements

3、 Data with high probability can be automatically processed by other systems , Improve the speed of dealing with problems

4、 You can use the player's report , Increase the accumulation of data

5、 This idea can be used as a treatment of sensitive words , Provide a dictionary of sensitive words , And then match , testing

6、 It can be made into web service , Play a callback game

7、 The model can be made to predict while learning , For example, some cases need to be handled manually by customer service , After marking, it is directly added to the dataset , In this way, the data model can be learned all the time s

## 6、 Problems encountered

1、 Problems encountered , Punctuation problem , If punctuation is not removed, it will lead to matching. Punctuation is also regarded as matching , unreasonable .

2、 Coding problem , It reads binary , It took a long time to solve

3、 Technology selection , At the beginning, I wanted to use deep learning to solve , I also saw some solutions , However, my computer training is too slow , First choose this way to practice

4、 The code is simple , But it's hard to explain Technology , The code is already written , But it took a weekend to write this article

## 7、 summary ：

If you encounter problems, go to the technical solution , If you know the plan, implement it , encounter bug Just go and check , If you can't forget, there will be echoes , Any attempt you make is a good opportunity to learn