current position:Home>Three basic data quality evaluation methods and python implementation
Three basic data quality evaluation methods and python implementation
2022-05-15 05:51:38【Broad starfish sink】
Basic data quality evaluation method
1、 Saturation of statistical table data ( Null rate ) analysis
2、 Through code table comparison analysis, threshold accuracy analysis ( Correlation between real-time table and dimension table )
3、 Time saturation ( Data continuity ) analysis
4、 Relationship between subject table and behavior table ( No examples yet , Design according to the specific scene )
Null rate of statistical table data
# brief introduction : Generate the field saturation of a table ( Null rate ) SQL Script , You can choose to run manually or automatically
# author : Wang Zhendong
# date :2021-02-07
from odps import ODPS
o = ODPS('ak', 'sk', 'project_namme', endpoint='http://xxxxx/api')
def check_data_by_execute_sql(table_name, partition):
ta = o.get_table(table_name)
sql_str = 'select \n'
for col in ta.schema.columns:
col_name = col.name
col_comm = col.comment
if col_comm == 'null' or col_comm is None or col_comm == '':
continue
sql_str += "sum(case when (%s is null) or (%s in ('', 'null', 'NULL', '-')) or " \
"(trim(%s) = '') then 1 else 0 end)/count(1) as `%s`,\n" % \
(col_name, col_name, col_name, col_comm)
sql_str += "count(1) as total_cnt \nfrom %s where %s" %(table_name, partition)
# print(sql_str)
print('| Field name | Null rate |\n|----|----|\n')
statistic = {'0.0': 0, '0.1': 0, '0.2': 0, '0.3': 0, '0.4': 0,
'0.5': 0, '0.6': 0, '0.7': 0, '0.8': 0, '0.9': 0, '1.0': 0}
with o.execute_sql(sql_str).open_reader() as rec:
rf = rec.to_result_frame()
n = rf.names
v = rf.values[0]
for i in range(len(n) - 1):
print("|%s|%.2f|" % (n[i], v[i]))
statistic['%.1f' % v[i]] = statistic['%.1f' % v[i]] + 1
print(' Total data ', v[-1])
for i in statistic.keys():
print(i, statistic[i])
def main():
check_data_by_execute_sql('table_name', "partition_name")
check_data_by_execute_sql('table_name', "partition_name")
check_data_by_execute_sql('table_name', "partition_name")
if __name__ == '__main__':
main()
The threshold accuracy is analyzed by comparing the code table
# brief introduction : spell SQL Screening code table comparison
# author : Wang Zhendong
# date :2021-01-07
"""
Import the dictionary of a field into the standard dictionary table template , Dictionary table structure :
create table dim_gjlcb2d0_zdbm_all (
code string,
name string)
comment ' Diagnostic code table ';
create table dim_gjlcb2d0_ssbm_all (
code string,
name string)
comment ' Operation code table ';
Statistical content : Code table comparison failed name/code Yes , And the corresponding number of data pieces
"""
"""
var:
code_field Code fields
name_field Name field
code_ch Code field comments
name_ch Name field comment
table_name The name of the table
dim_table Dictionary table name
partition Specify the data partition
"""
def fill_sql(var):
code_field, name_field, code_ch, name_ch, table_name, dim_table, partition = var
sql_model = """
-- Abnormal Statistics
select '%s_%s',count(1) Incomparable name_code logarithm ,sum(cnt) Number of data involved
from (select code, name, cnt
from (select %s code, %s name,count(1) cnt from %s
where %s and %s is not null and %s is not null and %s !='-' and %s != '-' group by %s,%s) t_1 -- Filter out the null value
left anti join %s t_2
on t_1.code = t_2.code and t_1.name = t_2.name) t_3;
"""
detail_sql_model = """
-- Exception details
select code %s, name %s, cnt Data volume
from (select %s code, %s name,count(1) cnt from %s
where %s and %s is not null and %s is not null and %s !='-' and %s != '-' group by %s,%s) t_1 -- Filter out the null value
left anti join %s t_2
on t_1.code = t_2.code and t_1.name = t_2.name;
"""
# code_ch, name_ch
return sql_model % (code_ch, name_ch, code_field, name_field, table_name, partition, code_field, name_field,
code_field, name_field, code_field, name_field, dim_table)
def main():
var_list = [('c', 'n', ' Diagnostic code ', ' Diagnosis name ', 'xxxxxxxxx',
'dim_gjlcb2d0_zdbm_all', "partition'")]
for v in var_list:
print(fill_sql(v))
if __name__ == '__main__':
main()
Time saturation
-- Be careful : If there is dirty data in the table ( For example, the date is 9999-01-01 00:00:00 Time for , Need to increase the where Conditions filter out )
-- View the time interval of the data
select
max(time_field),min(time_field)
from table_name
where pt = 'partition_name';
-- Duration of Statistics ( Day level )
select
datediff(max(time_field),min(time_field),'dd') The total number of days the data lasts
from table_name
where pt = 'partition_name'
and time_field between ' constraint condition ' and ' constraint condition ';
-- The day level time saturation of the organization dimension : organization , Number of days with data , Time saturation , Total data
select org_code organization , count(1) Number of days with data , count(1)/ The total number of days the data lasts Time saturation , sum(cnt) Total data
from (
select
org_code, to_char(time_field, 'yyyy-mm-dd') dd, count(1) cnt
from table_name
where pt = 'partition_name'
and time_field between ' constraint condition ' and ' constraint condition '
group by org_code, to_char(time_field, 'yyyy-mm'))
where cnt > 0
group by org_code;
copyright notice
author[Broad starfish sink],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/131/202205110610439334.html
The sidebar is recommended
- Python development alert notification SMS alert
- How to configure Python environment library offline in FME
- Python: fastapi - beginner interface development
- Generate password based on fast token and fast token
- [Django CI system] use of json-20220509
- [Django CI system] if the front-end date is complete, it will be fully updated to the back-end; If the front-end date is incomplete, the date will not be updated to the back-end-20220510
- [Django CI system] echarts dataset standard writing - 20220509
- [Django CI system] obtain the current time, the first day and the last day of the month, etc. - 20220510
- wxPython wx. Correction of font class · Wx Font tutorial
- NCT youth programming proficiency level test python programming level 3 - simulation volume 2 (with answers)
guess what you like
Design of personal simple blog system based on Django (with source code acquisition method)
[Python Script] classify pictures according to their definition
Wu Enda's classic ml class is fully upgraded! Update to Python implementation and add more intuitive visual teaching
Six built-in functions called immortals in Python
Some insights of pandas in machine learning
Introduction to Python [preliminary knowledge] - programming idea
Stay up late to tidy up! Pandas text processing Encyclopedia
Python recursion to find values by dichotomy
Open 3D Python Interface
[true title 02 of Blue Bridge Cup] Python output natural number youth group analysis of true title of Blue Bridge Cup Python national competition
Random recommended
- Introduction to the differences between Python and Java
- Explain Python CONDA in detail
- The pycham downloaded by MAC reports an error as soon as it is opened. The downloaded Python interpreter is also the latest version
- From entry to mastery, python full stack engineers have personally taught Python core technology and practical combat for ten years
- Python is used to detect some problems of word frequency in English text.
- How to choose between excel, database and pandas (Python third-party library)?
- WxPython download has been reporting errors
- Pyside6 UIC and other tools cannot be found in the higher version of pyside6 (QT for Python 6). How to solve it?
- About Python Crawlers
- Successfully imported pandas, unable to use dataframe
- How to extract some keywords in the path with Python
- Python encountered a problem reading the file!
- When Python is packaged into exe, an error is reported when opening assertionerror: C: \ users \ Acer \ appdata \ local \ temp\_ MEI105682\distutils\core. pyc
- Eight practical "no code" features of Python
- Python meets SQL, so a useful Python third-party library appears
- 100 Python algorithm super detailed explanation: a hundred dollars and a hundred chickens
- [fundamentals of Python] Python code and so on
- When Python uses probit regression, the program statement is deleted by mistake, and then it appears_ raise_ linalgerror_ Unrecognized error of singular
- Python testing Nicholas theorem
- Accelerating parallel computing based on python (BL) 136
- Python dynamic programming (knapsack problem and longest common substring)
- Django uses queryset filter save, and an 'queryset' object has no attribute 'Save' error occurs. Solution?
- Analysis of built-in functions in Python learning
- Python office automation - 90 - file automation management - cleaning up duplicate files and batch modifying file names
- Python office automation - 91 - word file Automation - word operation and reading word files
- After python, go also runs smoothly on the browser
- Self taught Python 26 method
- Summary of Python Tkinter component function examples (code + effect picture) (RadioButton | button | entry | menu | text)
- Python implementation of official selection sorting of Luogu question list
- Application of Django template
- Get project root path and other paths in Python project
- Get, rename, and delete file names in Python projects
- How to set the width and height of Python operation table
- Python string preceded by 'f' R 'B' U '
- JSON and other types convert to each other in Python
- Key value of key combination in pynput in Python
- Conversion of Python PDF file to word file
- Interface testing uses Python decorators
- Get the current time in Python
- Python course notes -- Python string, detailed explanation of related functions