current position:Home>Python Complete Guide - printing data using pyspark
Python Complete Guide - printing data using pyspark
2022-01-30 20:43:07 【Stone sword】
Now let's learn how to use PySpark Print data . Data is one of the most basic things today . It can be provided in encrypted or decrypted format . in fact , We also tend to create a lot of information every day . Whether it's clicking a button on our smartphone , Or browse the web on our computer . however , Why do we talk so much about this issue ?
The main problem researchers have encountered in the past few years is ** How to manage so much information ?** Technology is the answer to this question .Apache Spark Appearance , Build out PySpark To solve this problem .
If you are PySpark novice , Here's one PySpark Tutorials get you started .
Use Pyspark Introduction to
Apache Spark Is a data management engine , Help us invent analysis related solutions for huge software development projects .
It is also a choice tool for big data engineers and data scientists . master Spark Knowledge is one of the skills that technology companies need to recruit .
It has many expansion and management options . One of them is from Python Of Pyspark, Is for Python The developer prepared . This is the of the support library API One of , Can be explicitly installed on each computer . therefore , This makes it easy to manage and implement . We all know , stay Python It's easy to install libraries in .
When we use PySpark Before printing data
Before we start learning to use PySpark Before printing data in different ways **, There are some preconditions that we need to consider .**
- Yes Python Core understanding of
- Yes Pyspark And its support package .
- Python 3.6 And above
- Java 1.8 And above ( The most compulsory ).
- One IDE, Such as Jupyter Notebook or VS Code.
To check these , Please enter the command prompt and enter the command .
python --version
Copy code
java -version
Copy code
Version checking
You can use PySpark Print data .
- Print raw data
- Format printed data
- Show top 20-30 That's ok
- Show the bottom 20 That's ok
- Sort the data before displaying
Resources and tools used in the rest of this tutorial .
- Data sets . titanic.csv
- Environmental Science . Anaconda
- Integrated development environment . Jupyter The notebook
Create a session
stay spark Environment , Conversation is the record holder of all instances of our activities . To create it , We use spark In the library SQL modular .
This SparkSession Class has one. Builder attribute , It has one **appname()** function . This function takes the name of the application as a string parameter .
And then we use **getOrCreate() Method to create an application , This method uses spot '.'** Operator call . Use these codes , We created our application as "App".
We are completely free to give any name to the application we create . Don't forget to create a session , Because we can't go on .
Code .
import pyspark
from pyspark.sql import SparkSession
session = SparkSession.builder.appName('App').getOrCreate() # creating an app
Copy code
Create a session
Use PySpark Different ways to print data
Now you're ready , Let's get into the real deal . Now we will learn to use PySpark Different ways to print data .
1. Print raw data
In this case , We will work with a raw data set . stay AI( Artificial intelligence domain , We call the collection of data Data sets .
It appears in various forms , Such as excel、 Comma separated value file 、 Text file or server document Model . therefore , Note what type of file format we use to print the original data .
ad locum , We are using an extension called **.csv Data set of . The session Read ** Property has various functions for reading files .
These functions are usually named according to different file types . therefore , We use... For our dataset csv() function . We store everything in data variables .
Code
data = session.read.csv('Datasets/titanic.csv')
data # calling the variable
Copy code
By default ,Pyspark Will character string Read all the data in the form of . therefore , We call our data variables , Then it returns the number of each column as a string .
To print raw data , Please use the dot operator --'.', Calling in data variables **show()** function .
data.show()
Copy code
Reading data sets
2. Format data
Pyspark Data formatting in means displaying data sets Columns It's appropriate data type . To display all the titles , We use option() function . This function requires two string arguments .
- The key
- value
about key Parameters , The value we give is header , The value is true. The purpose of this is , It will scan the header to be displayed instead of the column number above .
The most important thing is to scan the data type of each column . So , We need to activate the previously used to read the dataset csv() Function inferschema Parameters . This is a Boolean Parameters of data type , in other words , We need to set it to True To activate it . We connect each function with a dot operator .
Code .
data = session.read.option('header', 'true').csv('Datasets/titanic.csv', inferSchema = True)
data
Copy code
data.show()
Copy code
Display data in the correct format
Output
We can see , The title and the appropriate data type are visible .
3. Before display 20-30 That's ok
To display the front 20-30 That's ok , We only need one line of code to do .**show() Function does this for us . If the data set is too large , It will default to the front 20 That's ok . however , We can make it display as many lines as possible . Just take this number as show()** Just one parameter of the function .
data.show() # to display top 20 rows
Copy code
Before display 20 That's ok
data.show(30) # to display top 30 rows
Copy code
Before display 30 That's ok
We can use **head()** Function to achieve the same function . This function specifically provides access to the top row of the dataset . It takes the number of rows as an argument , Display by rows . for example , To display the front 10 That's ok
data.head(10)
Copy code
however , The result is in the form of an array or list . The most disappointing thing is , We can't use... For large data sets with thousands of rows head() function . The following is proof of this .
Use the header method to print before 10 That's ok
4. Show the bottom 20-30 That's ok
This is also an easy task .tail() Function can help us accomplish this task . Call it with a data frame variable , Then give us the number of rows we want to display as a parameter . for example , To show the last 20 That's ok , We write the code as .
data.tail(20)
Copy code
Show Bolton 20 That's ok
alike , We can't do any proper view , Because our data set is too large , These lines cannot be displayed .
5. Sort the data before displaying
Sorting is a process , In the process , We put things in proper order . This can be Ascending -- From small to large or Descending -- From big to small . This plays an important role when viewing data points in order . The columns in the data frame can be of various types . however , The two main types are Integers and character string .
- The sorting of integers is based on large numbers and decimals .
- Strings are sorted alphabetically .
Pyspark Medium sort() The function is only used for this purpose . It can accept single or multiple columns as its parameters . Let's try our dataset . We'll look at... From the dataset PassengerID Sort columns . So , We have two functions .
- sort()
- orderBy()
Sort in ascending order
data = data.sort('PassengerId')
data.show(5)
Copy code
Sort a single column 1
PassengerID The column has been sorted . The code arranges all the elements in ascending order . Here we only sort single columns . To sort multiple columns , We can do it in sort() Pass one by one in function , Separate each column with a comma .
data = data.sort('Name', 'Fare')
data.show(5)
Copy code
Sort multiple columns
In descending order
This is for **orderBy()** Functional . This function provides a special option to sort our data in descending order .
under these circumstances , All the code is the same , It's just that we're inserting columns and using Dot operator After connecting with them , stay orderBy() Call a function in a function **desc()** function .
**desc()** Arrange or sort all the elements of those specific columns in Descending .
First , Let's look at all the columns in the dataset .
Code .
data.columns
Copy code
List the records
In the following code , We will be on Name and Fare Sort columns . The name is a string data type , So it will be sorted alphabetically . and Fare It's a number , So it will be big - Small patterns are sorted .
Code
data = data.orderBy(data.Name.desc(), data.Fare.desc())
data.show(5)
Copy code
Sort in descending order
summary
therefore , This is about how we use Pyspark Print the entire contents of the data . Every piece of code is short , It's easy to understand . This is enough for us to understand the code knowledge of spark function . This environment for big data And other industrial and technological fields are very powerful .
copyright notice
author[Stone sword],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201302043047062.html
The sidebar is recommended
- Introduction to python (IV) dynamic web page analysis and capture
- leetcode 119. Pascal's Triangle II(python)
- leetcode 31. Next Permutation(python)
- [algorithm learning] 807 Maintain the city skyline (Java / C / C + + / Python / go / trust)
- The rich woman's best friend asked me to write her a Taobao double 11 rush purchase script in Python, which can only be arranged
- Glom module of Python data analysis module (1)
- Python crawler actual combat, requests module, python realizes the full set of skin to capture the glory of the king
- Summarize some common mistakes of novices in Python development
- Python libraries you may not know
- [Python crawler] detailed explanation of selenium from introduction to actual combat [2]
guess what you like
-
This is what you should do to quickly create a list in Python
-
On the 55th day of the journey, python opencv perspective transformation front knowledge contour coordinate points
-
Python OpenCV image area contour mark, which can be used to frame various small notes
-
How to set up an asgi Django application with Postgres, nginx and uvicorn on Ubuntu 20.04
-
Initial Python tuple
-
Introduction to Python urllib module
-
Advanced Python Basics: from functions to advanced magic methods
-
Python Foundation: data structure summary
-
Python Basics: from variables to exception handling
-
Python notes (22): time module and calendar module
Random recommended
- Python notes (20): built in high-order functions
- Python notes (17): closure
- Python notes (18): decorator
- Python notes (16): generators and iterators
- Python notes (XV): List derivation
- Python tells you what timing attacks are
- Python -- file and exception
- [Python from introduction to mastery] (IV) what are the built-in data types of Python? Figure out
- Python code to scan code to pay attention to official account login
- [algorithm learning] 1221 Split balanced string (Java / C / C + + / Python / go / trust)
- Python notes (22): errors and exceptions
- Python has been hidden for ten years, and once image recognition is heard all over the world
- Python notes (21): random number module
- Python notes (19): anonymous functions
- Use Python and OpenCV to calculate and draw two-dimensional histogram
- Python, Hough circle transformation in opencv
- A library for reading and writing markdown in Python: mdutils
- Datetime of Python time operation (Part I)
- The most useful decorator in the python standard library
- Python iterators and generators
- [Python from introduction to mastery] (V) Python's built-in data types - sequences and strings. They have no girlfriend, not a nanny, and can only be used as dry goods
- Does Python have a, = operator?
- Go through the string common sense in Python
- Fanwai 4 Handling of mouse events and solutions to common problems in Python opencv
- Summary of common functions for processing strings in Python
- When writing Python scripts, be sure to add this
- Python web crawler - Fundamentals (1)
- Pandas handles duplicate values
- Python notes (23): regular module
- Python crawlers are slow? Concurrent programming to understand it
- Parameter passing of Python function
- Stroke tuple in Python
- Talk about ordinary functions and higher-order functions in Python
- [Python data acquisition] page image crawling and saving
- [Python data collection] selenium automated test framework
- Talk about function passing and other supplements in Python
- Python programming simulation poker game
- leetcode 160. Intersection of Two Linked Lists (python)
- Python crawler actual combat, requests module, python to grab the beautiful wallpaper of a station
- Fanwai 5 Detailed description of slider in Python opencv and solutions to common problems