Loading and Processing Images from HDFS into RGB arrays with PySpark
Link to the code:
In the realm of big data processing, images often hold valuable information. PySpark, a powerful framework built on Apache Spark, empowers you to distribute and process massive image datasets efficiently.
But, handle Images from HDFS, Hadoop Distributer File System and Pyspark it is not that easy!
This tutorial delves into leveraging PySpark to load images stored in HDFS (Hadoop Distributed File System) and convert them into RGB (Red, Green, Blue) arrays – a fundamental representation for image manipulation and analysis.
Why Use PySpark for Image Loading?
Distributed Processing: HDFS can house vast image collections. PySpark excels at parallelizing image processing tasks across multiple machines, significantly accelerating the process.
Scalability: As your image dataset expands, PySpark seamlessly scales up its processing power to maintain performance.
Flexibility: PySpark integrates with various image processing libraries like OpenCV and scikit-image, allowing for diverse image manipulation and analysis tasks.
For this tutorial, I'll work with the images bellow, they have already been sent to hadoop, on this path: "user1/images"
1 - Load Libraries
For image processing, PIL (Pillow) was used, it is a Python Imaging Library for working with images.
If you don't have it installed on your system yet, just install using pip
! pip install Pillow
So, let's import the necessary libraries and create a spark session:
Now, define the data_path, and load the images in a df, (Note: the format will be "binaryFile". because loading as a binary file, the column 'content' contains the binary format of the image, and this information will be used to convert it into a numpy array.
Now, creat a function to convert each 'content' from the rdd into a image and after it into a numpy array:
Function Breakdown:
Content: This variable stores the binary data representing the image for each row.
Image Conversion: Using the "io" module, we convert the binary content into a Python Image Library (PIL) image object.
RGB Conversion: The "convert" method of the PIL image object is used to transform the image into RGB format, resulting in a three-channel array representing the red, green, and blue components of each pixel.
Return Value: The function returns a NumPy array containing the RGB values.
Creating the arrays
Pyspark rdd has two functions, rdd.map and rdd.foreach , both are functions to execute a function in each row of rdd file.
rdd.map: Applies a function to each element of an RDD, creating a new RDD with the transformed elements. it returns a new rdd.
rdd.foreach: Applies a function over each element of an RDD. but it doesn't return any result and it doesn't create a new rdd.
For this case, a rdd.map function will be used.
Now that the RGB array was created, we can print the array , plot the images and do the analysis or deep learning models we want.
Thank you for following along with this tutorial on loading images from HDFS into RGB arrays using PySpark!
I hope you found it informative and gained valuable insights into this powerful technique.