current position：Home>Python Complete Guide - printing data using pyspark
Python Complete Guide - printing data using pyspark
2022-01-30 20:43:07 【Stone sword】
Now let's learn how to use PySpark Print data . Data is one of the most basic things today . It can be provided in encrypted or decrypted format . in fact , We also tend to create a lot of information every day . Whether it's clicking a button on our smartphone , Or browse the web on our computer . however , Why do we talk so much about this issue ？
The main problem researchers have encountered in the past few years is ** How to manage so much information ？** Technology is the answer to this question .Apache Spark Appearance , Build out PySpark To solve this problem .
If you are PySpark novice , Here's one PySpark Tutorials get you started .
Use Pyspark Introduction to
Apache Spark Is a data management engine , Help us invent analysis related solutions for huge software development projects .
It is also a choice tool for big data engineers and data scientists . master Spark Knowledge is one of the skills that technology companies need to recruit .
It has many expansion and management options . One of them is from Python Of Pyspark, Is for Python The developer prepared . This is the of the support library API One of , Can be explicitly installed on each computer . therefore , This makes it easy to manage and implement . We all know , stay Python It's easy to install libraries in .
When we use PySpark Before printing data
Before we start learning to use PySpark Before printing data in different ways **, There are some preconditions that we need to consider .**
- Yes Python Core understanding of
- Yes Pyspark And its support package .
- Python 3.6 And above
- Java 1.8 And above （ The most compulsory ）.
- One IDE, Such as Jupyter Notebook or VS Code.
To check these , Please enter the command prompt and enter the command .
python --version Copy code
java -version Copy code
You can use PySpark Print data .
- Print raw data
- Format printed data
- Show top 20-30 That's ok
- Show the bottom 20 That's ok
- Sort the data before displaying
Resources and tools used in the rest of this tutorial .
- Data sets . titanic.csv
- Environmental Science . Anaconda
- Integrated development environment . Jupyter The notebook
Create a session
stay spark Environment , Conversation is the record holder of all instances of our activities . To create it , We use spark In the library SQL modular .
This SparkSession Class has one. Builder attribute , It has one **appname()** function . This function takes the name of the application as a string parameter .
And then we use **getOrCreate() Method to create an application , This method uses spot '.'** Operator call . Use these codes , We created our application as "App".
We are completely free to give any name to the application we create . Don't forget to create a session , Because we can't go on .
import pyspark from pyspark.sql import SparkSession session = SparkSession.builder.appName('App').getOrCreate() # creating an app Copy code
Create a session
Use PySpark Different ways to print data
Now you're ready , Let's get into the real deal . Now we will learn to use PySpark Different ways to print data .
1. Print raw data
In this case , We will work with a raw data set . stay AI（ Artificial intelligence domain , We call the collection of data Data sets .
It appears in various forms , Such as excel、 Comma separated value file 、 Text file or server document Model . therefore , Note what type of file format we use to print the original data .
ad locum , We are using an extension called **.csv Data set of . The session Read ** Property has various functions for reading files .
These functions are usually named according to different file types . therefore , We use... For our dataset csv() function . We store everything in data variables .
data = session.read.csv('Datasets/titanic.csv') data # calling the variable Copy code
By default ,Pyspark Will character string Read all the data in the form of . therefore , We call our data variables , Then it returns the number of each column as a string .
To print raw data , Please use the dot operator --'.', Calling in data variables **show()** function .
data.show() Copy code
Reading data sets
2. Format data
Pyspark Data formatting in means displaying data sets Columns It's appropriate data type . To display all the titles , We use option() function . This function requires two string arguments .
- The key
about key Parameters , The value we give is header , The value is true. The purpose of this is , It will scan the header to be displayed instead of the column number above .
The most important thing is to scan the data type of each column . So , We need to activate the previously used to read the dataset csv() Function inferschema Parameters . This is a Boolean Parameters of data type , in other words , We need to set it to True To activate it . We connect each function with a dot operator .
data = session.read.option('header', 'true').csv('Datasets/titanic.csv', inferSchema = True) data Copy code
data.show() Copy code
Display data in the correct format
We can see , The title and the appropriate data type are visible .
3. Before display 20-30 That's ok
To display the front 20-30 That's ok , We only need one line of code to do .**show() Function does this for us . If the data set is too large , It will default to the front 20 That's ok . however , We can make it display as many lines as possible . Just take this number as show()** Just one parameter of the function .
data.show() # to display top 20 rows Copy code
Before display 20 That's ok
data.show(30) # to display top 30 rows Copy code
Before display 30 That's ok
We can use **head()** Function to achieve the same function . This function specifically provides access to the top row of the dataset . It takes the number of rows as an argument , Display by rows . for example , To display the front 10 That's ok
data.head(10) Copy code
however , The result is in the form of an array or list . The most disappointing thing is , We can't use... For large data sets with thousands of rows head() function . The following is proof of this .
Use the header method to print before 10 That's ok
4. Show the bottom 20-30 That's ok
This is also an easy task .tail() Function can help us accomplish this task . Call it with a data frame variable , Then give us the number of rows we want to display as a parameter . for example , To show the last 20 That's ok , We write the code as .
data.tail(20) Copy code
Show Bolton 20 That's ok
alike , We can't do any proper view , Because our data set is too large , These lines cannot be displayed .
5. Sort the data before displaying
Sorting is a process , In the process , We put things in proper order . This can be Ascending -- From small to large or Descending -- From big to small . This plays an important role when viewing data points in order . The columns in the data frame can be of various types . however , The two main types are Integers and character string .
- The sorting of integers is based on large numbers and decimals .
- Strings are sorted alphabetically .
Pyspark Medium sort() The function is only used for this purpose . It can accept single or multiple columns as its parameters . Let's try our dataset . We'll look at... From the dataset PassengerID Sort columns . So , We have two functions .
Sort in ascending order
data = data.sort('PassengerId') data.show(5) Copy code
Sort a single column 1
PassengerID The column has been sorted . The code arranges all the elements in ascending order . Here we only sort single columns . To sort multiple columns , We can do it in sort() Pass one by one in function , Separate each column with a comma .
data = data.sort('Name', 'Fare') data.show(5) Copy code
Sort multiple columns
In descending order
This is for **orderBy()** Functional . This function provides a special option to sort our data in descending order .
under these circumstances , All the code is the same , It's just that we're inserting columns and using Dot operator After connecting with them , stay orderBy() Call a function in a function **desc()** function .
**desc()** Arrange or sort all the elements of those specific columns in Descending .
First , Let's look at all the columns in the dataset .
data.columns Copy code
List the records
In the following code , We will be on Name and Fare Sort columns . The name is a string data type , So it will be sorted alphabetically . and Fare It's a number , So it will be big - Small patterns are sorted .
data = data.orderBy(data.Name.desc(), data.Fare.desc()) data.show(5) Copy code
Sort in descending order
therefore , This is about how we use Pyspark Print the entire contents of the data . Every piece of code is short , It's easy to understand . This is enough for us to understand the code knowledge of spark function . This environment for big data And other industrial and technological fields are very powerful .
author[Stone sword],Please bring the original link to reprint, thank you.
The sidebar is recommended
- Introduction to python (IV) dynamic web page analysis and capture
- leetcode 119. Pascal's Triangle II（python）
- leetcode 31. Next Permutation（python）
- [algorithm learning] 807 Maintain the city skyline (Java / C / C + + / Python / go / trust)
- The rich woman's best friend asked me to write her a Taobao double 11 rush purchase script in Python, which can only be arranged
- Glom module of Python data analysis module (1)
- Python crawler actual combat, requests module, python realizes the full set of skin to capture the glory of the king
- Summarize some common mistakes of novices in Python development
- Python libraries you may not know
- [Python crawler] detailed explanation of selenium from introduction to actual combat 
guess what you like
This is what you should do to quickly create a list in Python
On the 55th day of the journey, python opencv perspective transformation front knowledge contour coordinate points
Python OpenCV image area contour mark, which can be used to frame various small notes
How to set up an asgi Django application with Postgres, nginx and uvicorn on Ubuntu 20.04
Initial Python tuple
Introduction to Python urllib module
Advanced Python Basics: from functions to advanced magic methods
Python Foundation: data structure summary
Python Basics: from variables to exception handling
Python notes (22): time module and calendar module
- Python notes (20): built in high-order functions
- Python notes (17): closure
- Python notes (18): decorator
- Python notes (16): generators and iterators
- Python notes (XV): List derivation
- Python tells you what timing attacks are
- Python -- file and exception
- [Python from introduction to mastery] (IV) what are the built-in data types of Python? Figure out
- Python code to scan code to pay attention to official account login
- [algorithm learning] 1221 Split balanced string (Java / C / C + + / Python / go / trust)
- Python notes (22): errors and exceptions
- Python has been hidden for ten years, and once image recognition is heard all over the world
- Python notes (21): random number module
- Python notes (19): anonymous functions
- Use Python and OpenCV to calculate and draw two-dimensional histogram
- Python, Hough circle transformation in opencv
- A library for reading and writing markdown in Python: mdutils
- Datetime of Python time operation (Part I)
- The most useful decorator in the python standard library
- Python iterators and generators
- [Python from introduction to mastery] (V) Python's built-in data types - sequences and strings. They have no girlfriend, not a nanny, and can only be used as dry goods
- Does Python have a, = operator?
- Go through the string common sense in Python
- Fanwai 4 Handling of mouse events and solutions to common problems in Python opencv
- Summary of common functions for processing strings in Python
- When writing Python scripts, be sure to add this
- Python web crawler - Fundamentals (1)
- Pandas handles duplicate values
- Python notes (23): regular module
- Python crawlers are slow? Concurrent programming to understand it
- Parameter passing of Python function
- Stroke tuple in Python
- Talk about ordinary functions and higher-order functions in Python
- [Python data acquisition] page image crawling and saving
- [Python data collection] selenium automated test framework
- Talk about function passing and other supplements in Python
- Python programming simulation poker game
- leetcode 160. Intersection of Two Linked Lists （python）
- Python crawler actual combat, requests module, python to grab the beautiful wallpaper of a station
- Fanwai 5 Detailed description of slider in Python opencv and solutions to common problems