current position:Home>"Python instance" was shocked and realized the dirty words and advertisement detection of the chat system with Python

"Python instance" was shocked and realized the dirty words and advertisement detection of the chat system with Python

2022-01-30 06:25:44 Coriander Chat Game

Little knowledge , Great challenge ! This article is participating in “ A programmer must have a little knowledge

This article also participates in  「 Digging force Star Program 」 , Win a creative gift bag , Challenge creation incentive fund

In the game, chat function is almost a necessary function , There are some problems with this function, that is, it will cause the world channel to be very chaotic , There are often some sensitive words , Or chat that some game manufacturers don't want to see , We had this problem in the game before , Our company has done reporting and background monitoring , Let's realize this kind of monitoring today .

1、 Demand analysis :

Because deep learning is not good , Although I have written about reinforcement learning before , But the results of reinforcement learning are not particularly satisfactory , So study the simpler method to achieve .

There are ready-made solutions for this classification task , For example, the classification of spam is the same problem , Although there are different solutions , But I chose the simplest naive Bayesian classification . Mainly do some exploration ,

Because most of our games are in Chinese , So we need to segment Chinese words , For example, I'm a handsome guy , Break it up .

2、 Algorithm principle :

Naive bayes algorithm , It is an algorithm to judge the category of the new sample according to the conditional probability of the existing characteristics of the new sample in the data set ; It assumes that ① Each feature is independent of each other 、② Each feature is equally important . It can also be understood as judging the probability when the current characteristics are satisfied at the same time according to the past probability . Specific math companies can baidu themselves , The data formula is too hard to write , Just know about it .

Use the right algorithm at the right time .

jieba The principle of word segmentation :jieba Word segmentation belongs to probabilistic language model . The task of probabilistic language model word segmentation is : In all the results of total segmentation, find a segmentation scheme S, bring P(S) Maximum .


You can see jieba I brought some phrases , During segmentation, these phrases will be split as the base unit .

notes : I just briefly introduce the principles of the above two technologies , If you want to fully understand, you have to write another big article , Can baidu next , Everywhere, , Just find something you can understand . If you can use it, use it first .

3、 Technical analysis

Chinese word segmentation bag the most famous word segmentation bag is jieba, As for whether it is the best, I don't know , I think fire has its reason , Do it first .jieba Don't delve into the principle of , Give priority to solving problems , When you encounter problems, you can learn from the problem points , Such a learning model is the most efficient .

Because I've been doing voice related things recently , A big man recommended Library nltk, Looking up the relevant information , It seems to be a well-known library for language processing , Very powerful , It's very powerful , I mainly chose his classification algorithm here , So I don't have to focus on the specific implementation , You don't have to build wheels again , Besides, it's not as good as others , Just use it .

python It's really nice , All kinds of bags , All kinds of wheels .

Installation command :

pip install jieba
pip install nltk
 Copy code 

Enter the above two codes respectively , After running , The package is installed successfully , You can test happily

#Author:  Coriander 
@time: 2021/8/5 0005  Afternoon  10:26
import jieba
if __name__ == '__main__':
   result = " | ".join(jieba.cut(" I love tian 'anmen square in Beijing ,very happy"))
 Copy code 

Look at the word segmentation results , It can be said that it is very good , Sure enough, a major is a major .


4、 Source code

Simple tests were done , It can be found that we basically have everything to complete , Now start working directly on the code .

1、 Load the initial text resource .

2、 Remove punctuation marks from text

3、 Feature extraction of text

4、 Training data set , Training out models ( That is, the prediction model )

5、 Start testing new words

#!/usr/bin/env python
# encoding: utf-8
import re
import jieba
from nltk.classify import NaiveBayesClassifier
""" #Author:  Coriander  @time: 2021/8/5 0005  Afternoon  9:29 """
rule = re.compile(r"[^a-zA-Z0-9\u4e00-\u9fa5]")
def delComa(text):
    text = rule.sub('', text)
    return text
def loadData(fileName):
    text1 = open(fileName, "r", encoding='utf-8').read()
    text1 = delComa(text1)
    list1 = jieba.cut(text1)
    return " ".join(list1)
#  feature extraction 
def word_feats(words):
    return dict([(word, True) for word in words])
if __name__ == '__main__':
    adResult = loadData(r"ad.txt")
    yellowResult = loadData(r"yellow.txt")
    ad_features = [(word_feats(lb), 'ad') for lb in adResult]
    yellow_features = [(word_feats(df), 'ye') for df in yellowResult]
    train_set = ad_features + yellow_features
    #  Training decisions 
    classifier = NaiveBayesClassifier.train(train_set)
    #  Analysis test 
    sentence = input(" Please enter a sentence :")
    sentence = delComa(sentence)
    seg_list = jieba.cut(sentence)
    result1 = " ".join(seg_list)
    words = result1.split(" ")
    #  The statistical results 
    ad = 0
    yellow = 0
    for word in words:
     classResult = classifier.classify(word_feats(word))
     if classResult == 'ad':
        ad = ad + 1
     if classResult == 'ye':
        yellow = yellow + 1
    #  The proportion 
    x = float(str(float(ad) / len(words)))
    y = float(str(float(yellow) / len(words)))
    print(' The possibility of advertising :%.2f%%' % (x * 100))
    print(' The possibility of swearing :%.2f%%' % (y * 100))
 Copy code 

Look at the results of the operation


Download address of all resources…

5、 Expand

1、 The data source can be modified , The monitored data can be stored in the database for loading

2、 You can classify more data , It is convenient for customer service to handle , For example, it is divided into advertising , dirty language , Advice to officials, etc , Define according to business requirements

3、 Data with high probability can be automatically processed by other systems , Improve the speed of dealing with problems

4、 You can use the player's report , Increase the accumulation of data

5、 This idea can be used as a treatment of sensitive words , Provide a dictionary of sensitive words , And then match , testing

6、 It can be made into web service , Play a callback game

7、 The model can be made to predict while learning , For example, some cases need to be handled manually by customer service , After marking, it is directly added to the dataset , In this way, the data model can be learned all the time s

6、 Problems encountered

1、 Problems encountered , Punctuation problem , If punctuation is not removed, it will lead to matching. Punctuation is also regarded as matching , unreasonable .

2、 Coding problem , It reads binary , It took a long time to solve

3、 Technology selection , At the beginning, I wanted to use deep learning to solve , I also saw some solutions , However, my computer training is too slow , First choose this way to practice

4、 The code is simple , But it's hard to explain Technology , The code is already written , But it took a weekend to write this article

7、 summary :

If you encounter problems, go to the technical solution , If you know the plan, implement it , encounter bug Just go and check , If you can't forget, there will be echoes , Any attempt you make is a good opportunity to learn

copyright notice
author[Coriander Chat Game],Please bring the original link to reprint, thank you.

Random recommended