current position:Home>Python Complete Guide - printing data using pyspark

Python Complete Guide - printing data using pyspark

2022-01-30 20:43:07 Stone sword

Now let's learn how to use PySpark Print data . Data is one of the most basic things today . It can be provided in encrypted or decrypted format . in fact , We also tend to create a lot of information every day . Whether it's clicking a button on our smartphone , Or browse the web on our computer . however , Why do we talk so much about this issue ?

The main problem researchers have encountered in the past few years is ** How to manage so much information ?** Technology is the answer to this question .Apache Spark Appearance , Build out PySpark To solve this problem .

If you are PySpark novice , Here's one PySpark Tutorials get you started .

Use Pyspark Introduction to

Apache Spark Is a data management engine , Help us invent analysis related solutions for huge software development projects .

It is also a choice tool for big data engineers and data scientists . master Spark Knowledge is one of the skills that technology companies need to recruit .

It has many expansion and management options . One of them is from Python Of Pyspark, Is for Python The developer prepared . This is the of the support library API One of , Can be explicitly installed on each computer . therefore , This makes it easy to manage and implement . We all know , stay Python It's easy to install libraries in .

When we use PySpark Before printing data

Before we start learning to use PySpark Before printing data in different ways **, There are some preconditions that we need to consider .**

  1. Yes Python Core understanding of
  2. Yes Pyspark And its support package .
  3. Python 3.6 And above
  4. Java 1.8 And above ( The most compulsory ).
  5. One IDE, Such as Jupyter Notebook or VS Code.

To check these , Please enter the command prompt and enter the command .

python --version 

 Copy code 
java -version

 Copy code 

Version checking

You can use PySpark Print data .

  • Print raw data
  • Format printed data
  • Show top 20-30 That's ok
  • Show the bottom 20 That's ok
  • Sort the data before displaying

Resources and tools used in the rest of this tutorial .

Create a session

stay spark Environment , Conversation is the record holder of all instances of our activities . To create it , We use spark In the library SQL modular .

This SparkSession Class has one. Builder attribute , It has one **appname()** function . This function takes the name of the application as a string parameter .

And then we use **getOrCreate() Method to create an application , This method uses spot '.'** Operator call . Use these codes , We created our application as "App".

We are completely free to give any name to the application we create . Don't forget to create a session , Because we can't go on .

Code .

import pyspark 
from pyspark.sql import SparkSession 

session = SparkSession.builder.appName('App').getOrCreate() # creating an app

 Copy code 

Create a session

Use PySpark Different ways to print data

Now you're ready , Let's get into the real deal . Now we will learn to use PySpark Different ways to print data .

1. Print raw data

In this case , We will work with a raw data set . stay AI( Artificial intelligence domain , We call the collection of data Data sets .

It appears in various forms , Such as excel、 Comma separated value file 、 Text file or server document Model . therefore , Note what type of file format we use to print the original data .

ad locum , We are using an extension called **.csv Data set of . The session Read ** Property has various functions for reading files .

These functions are usually named according to different file types . therefore , We use... For our dataset csv() function . We store everything in data variables .


data ='Datasets/titanic.csv')
data # calling the variable

 Copy code 

By default ,Pyspark Will character string Read all the data in the form of . therefore , We call our data variables , Then it returns the number of each column as a string .

To print raw data , Please use the dot operator --'.', Calling in data variables **show()** function .

 Copy code 

Reading data sets

2. Format data

Pyspark Data formatting in means displaying data sets Columns It's appropriate data type . To display all the titles , We use option() function . This function requires two string arguments .

  1. The key
  2. value

about key Parameters , The value we give is header , The value is true. The purpose of this is , It will scan the header to be displayed instead of the column number above .

The most important thing is to scan the data type of each column . So , We need to activate the previously used to read the dataset csv() Function inferschema Parameters . This is a Boolean Parameters of data type , in other words , We need to set it to True To activate it . We connect each function with a dot operator .

Code .

data ='header', 'true').csv('Datasets/titanic.csv', inferSchema = True)

 Copy code

 Copy code 

Display data in the correct format


We can see , The title and the appropriate data type are visible .

3. Before display 20-30 That's ok

To display the front 20-30 That's ok , We only need one line of code to do .**show() Function does this for us . If the data set is too large , It will default to the front 20 That's ok . however , We can make it display as many lines as possible . Just take this number as show()** Just one parameter of the function . # to display top 20 rows

 Copy code 

Before display 20 That's ok # to display top 30 rows

 Copy code 

Before display 30 That's ok

We can use **head()** Function to achieve the same function . This function specifically provides access to the top row of the dataset . It takes the number of rows as an argument , Display by rows . for example , To display the front 10 That's ok


 Copy code 

however , The result is in the form of an array or list . The most disappointing thing is , We can't use... For large data sets with thousands of rows head() function . The following is proof of this .

Use the header method to print before 10 That's ok

4. Show the bottom 20-30 That's ok

This is also an easy task .tail() Function can help us accomplish this task . Call it with a data frame variable , Then give us the number of rows we want to display as a parameter . for example , To show the last 20 That's ok , We write the code as .


 Copy code 

Show Bolton 20 That's ok

alike , We can't do any proper view , Because our data set is too large , These lines cannot be displayed .

5. Sort the data before displaying

Sorting is a process , In the process , We put things in proper order . This can be Ascending -- From small to large or Descending -- From big to small . This plays an important role when viewing data points in order . The columns in the data frame can be of various types . however , The two main types are Integers and character string .

  1. The sorting of integers is based on large numbers and decimals .
  2. Strings are sorted alphabetically .

Pyspark Medium sort() The function is only used for this purpose . It can accept single or multiple columns as its parameters . Let's try our dataset . We'll look at... From the dataset PassengerID Sort columns . So , We have two functions .

  1. sort()
  2. orderBy()

Sort in ascending order

data = data.sort('PassengerId')

 Copy code 

Sort a single column 1

PassengerID The column has been sorted . The code arranges all the elements in ascending order . Here we only sort single columns . To sort multiple columns , We can do it in sort() Pass one by one in function , Separate each column with a comma .

data = data.sort('Name', 'Fare')

 Copy code 

Sort multiple columns

In descending order

This is for **orderBy()** Functional . This function provides a special option to sort our data in descending order .

under these circumstances , All the code is the same , It's just that we're inserting columns and using Dot operator After connecting with them , stay orderBy() Call a function in a function **desc()** function .

**desc()** Arrange or sort all the elements of those specific columns in Descending .

First , Let's look at all the columns in the dataset .

Code .


 Copy code 

List the records

In the following code , We will be on Name and Fare Sort columns . The name is a string data type , So it will be sorted alphabetically . and Fare It's a number , So it will be big - Small patterns are sorted .


data = data.orderBy(data.Name.desc(), data.Fare.desc())

 Copy code 

Sort in descending order


therefore , This is about how we use Pyspark Print the entire contents of the data . Every piece of code is short , It's easy to understand . This is enough for us to understand the code knowledge of spark function . This environment for big data And other industrial and technological fields are very powerful .

copyright notice
author[Stone sword],Please bring the original link to reprint, thank you.

Random recommended