Apache Spark usage for Healthcare Analytics
To put more light on how to achieve analytical queries using PySpark i have taken the reference data from kaglle. This data set primarily consist of patient profile and the health camp attended by the patient.
Below the code walk-through line wise
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
#Data Load from csv
Patient_Profile_df = spark.read.option(“inferSchema”,True).option(“header”,True).csv(“dbfs:/FileStore/tables/Patient_Profile.csv”)
Health_Camp_Detail_df = spark.read.option(“inferSchema”,True).option(“header”,True).csv(“dbfs:/FileStore/tables/Health_Camp_Detail.csv”)
First_Health_Camp_Attended_df = spark.read.option(“inferSchema”,True).option(“header”,True).csv(“dbfs:/FileStore/tables/First_Health_Camp_Attended.csv”)
Second_Health_Camp_Attended_df = spark.read.option(“inferSchema”,True).option(“header”,True).csv(“dbfs:/FileStore/tables/Second_Health_Camp_Attended.csv”)
Now once the data is loaded you can view/analyse the data using show() function on the respective data-frame to identify the data errors or irregularity.
The patient profile data consists of null values,duplicates and numerical values in string format.Now we will do the cleansing and data preparation activities for further analysis.
Patient_Profile_df = Patient_Profile_df.na.fill({‘City_Type’: ‘NA’,’Employer_Category’: ‘NA’})
Patient_Profile_df = Patient_Profile_df.withColumn(‘Education_Score’, regexp_replace(‘Education_Score’, ‘None’, “0”))
Patient_Profile_df = Patient_Profile_df.withColumn(‘Income’, regexp_replace(‘Income’, ‘None’, “0”))
Patient_Profile_df = Patient_Profile_df.withColumn(‘Age’, regexp_replace(‘Age’, ‘None’, “0”))
Patient_Profile_df = Patient_Profile_df.dropDuplicates()
Patient_Profile_df.show(5)

Also we will change the datatypes of string to integer for few columns such as age,income etc.
Patient_Profile_df = Patient_Profile_df.withColumn(“Age”, Patient_Profile_df[“Age”].cast(IntegerType()))
Patient_Profile_df = Patient_Profile_df.withColumn(“Education_Score”, Patient_Profile_df[“Education_Score”].cast(IntegerType()))
Patient_Profile_df = Patient_Profile_df.withColumn(“Age”, Patient_Profile_df[“Age”].cast(IntegerType()))
Patient_Profile_df = Patient_Profile_df.withColumn(“Income”, Patient_Profile_df[“Income”].cast(IntegerType()))
Now get the score first health score average of the patient using the following code.
First_Health_Camp_Attended_df = First_Health_Camp_Attended_df.groupBy(“Patient_ID”).agg({“Health_Score”:”avg”})
First_Health_Camp_Attended_df = First_Health_Camp_Attended_df.withColumnRenamed(“avg(Health_Score)”, “Health_Score”)
First_Health_Camp_Attended_df.sort(“Patient_ID”).show()

Perform same operation on the second health camp data

Now join the data for both the health camp
result_df = First_Health_Camp_Attended_df.join(Second_Health_Camp_Attended_df, “Patient_ID”)
result_df.sort(“Patient_ID”).show()
result_df = result_df.select(col(“Patient_ID”),((col(“Health_Score”) + col(“Health_Score_second”)) / lit(2)).alias(“mean”))
result_df.sort(“Patient_ID”).show()
Result 1

Result 2

Now we will combine the Result 2 data i.e result_df with patient profile data.
The below code is to identify the list of patient which are active on social media and have a health score more than .5
#print list of all those patients which are active on social media and have a health score more than .5
#Filters rows using the given condition. where() is an alias for filter().
final_result_df = Patient_Profile_df.join(result_df, on=[“Patient_ID”],how=”inner”)
final_result_df.filter((col(“Online_Follower”) >= “1”) & (col(“LinkedIn_Shared”) >= “1”) & (col(“Twitter_Shared”) >= “1”) & (col(“Facebook_Shared”) >= “1”) & (col(“Education_Score”) >= “1”) & (col(“Age”) <= “50”) & (col(“mean”) >= “.5”)).show(5)
The output is given below

Now to identify list of all those patients which are not health continuous as these people spend more time on social media sites are nuisance value and burden on society.
#print list of all those patients which are not health continuous as these people spend more time on scocial media sites are nusance value and burden on society.
final_result_df = Patient_Profile_df.join(result_df, on=[“Patient_ID”],how=”inner”)
final_result_df.filter((col(“Facebook_Shared”) >= “1”) &(col(“mean”) <= “.5”)).show(5)
The output for above code is given below
