Projects on Data Analytics Using R

@saranyaravikumar

  • The data that we were working on is about laptop prices in euros with a few variables.

  • The data is visualized as a bar plot below. 812c1a3b-8f3b-4bb2-ba8d-a2019196276f-image.png

-The findings can be conclude as Company Apple has higher price despite being heavy while Company Acer and HP has lower prices despite being almost similar laptop weight as Apple company

  • The ML algorithm used is supervised (Multiple Linear Regression) because the dataset has more than one input and only has one continuous numerical target.
    ml1.JPG
    ml2.JPG
    ml3.JPG

4417d333-1c4b-448e-966a-4f33c84252c0-image.png

Group name: HIJAU
Data: Car selling price prediction

  1. Our data is about the car details at Car Dekho. Car Dekho is a website that acts as mediator between seller and buyer. The car details are name of the car, year released of the car, selling price, km driven, fuel, seller type, transmission and owner.
    Screenshot (59).png

  2. We visualize the data using pie chart, bar graph, line graph, box plot and scatter plot.
    Screenshot (58).png

  3. The ML algorithm that we use is regression because the target column is continuous numerical data. Hence, there are many input and only one output.
    Screenshot (56).png

group: HIJAU

We are using data of car details at Car Daekho. Car Daekho is a website that sells used car and based in India which This data contains 8 variables.

For one of our analysis. We want to see the average of selling price for different type of owner. Therefore, we use group by keyword to group the owner column so we can extract the means for each type owner.

Screenshot (7).png
Screenshot (8).png

As seen in the last picture, we can see that used car that have first owner has the highest average selling price that others.

We are using KC House dataset. In this analysis, we cleaned and filtered the data and finally chose to focus on the price, zipcode(location of the property), no of bedrooms, size of the property(sqft_lot), and the view of the area.

df5.jpg

Through the analysis, we divided the area (zipcode) as urban or rural.

price_category.jpg

  1. It was found that, the house price under Urban area category has higher price compared to those in rural area.

sqft.jpg

  1. However, from the analysis, we found that the size or area of the property in rural area is much bigger than those in urban area

sqft_price.jpg

  1. Based on this two factors (the size and the area), we found that those house in rural area with larger size (sqft_lot) has much lower price compared to the smaller house in an urban area (higher price)

Further analysis could be retrieved here: link text

Hi, we are from Iris Group. Our dataset is about the salary of 30 employees in our company. After analyzing and exploring the data, we used a histogram to visualize the data. For the ML algorithm, we chose Simple Linear Regression as our data only consist of 1 input, 1 output, and both column in continuous numerical data.

To view the histogram:
download.png

To view the RShiny App:
Screenshot (239).png

Group members:
Noor Anis Nabilla Binti Ismail
Nursyafiqah Sharmeen Binti Hussin
Ummilia Balqis Binti Harun

WhatsApp Image 2022-11-04 at 7.35.15 PM.jpeg

Thank you for teaching us, Dr.Sara. We really enjoyed the course and learned a lot of new knowledge! 🙂

Group: Pfyzer
Our project title was "Covid-19 Total Case Prediction." The objective of our study is to analyze the total cases of Covid-19 from total death, total recovered and active cases all the countries, so that we can predict Covid-19 cases in the future. Our data consist 230 rows, and 13 columns. Below shows the data frame for the study.

Capture6.PNG

We visualize the data using ggplot. For example, we use ordered bar chart to analyze the relationship between country and total deaths. Below shows a few countries with their death cases.

Capture.PNG

For Machine Learning (ML) algorithm, we use Multiple Line Regression because has more than one numerical input.

Capture4.PNG
Capture5.PNG
Capture66.PNG

That's all I can share. My friends and I really enjoy joined this programme because we were trained by the professional trainees. Thank you so much Dr. Sara! Stay safe always ♥ 😊

Hi! We are from Pfyzer Group. Our findings on analyzing the various real time Datasets of Covid-19 Live cases.

  1. Our data is about the prediction of Covid-19 total cases. The total cases of Covid-19 including the active cases, total death and total recovered all around the world. As following below shown the data frame of 230 rows and 12 columns:

data frame covid.PNG

cont.df covid.PNG

  1. We visualize our data by using Box plot GG plot such as Bar Chart, Lollipop Chart, Line Plot and Scatter Plot. Below is one of the example for the chart:

Lollipop Chart
lollipop chart.PNG

Observation: The United States of America (USA) has the highest total recovered of Covid-19 cases among the top ten countries analysed.

  1. Machine Learning algorithm that we used for analyzing our data is Multiple Linear Regression. The reason why we used the Multiple Linear Regression is because the data are more than one independent variable and one dependent variable are present. Simply we can say that when the data frame have more than one input and has one output of numerical data. Visualize the results is shown below:

summary.PNG

Last but not least, thank you for everything Dr. Sara! Take care and stay safe ❤

Dataset : full_data.csv

1. Understanding the data
As no description was provided following the dataset, we begin with establishing our general understanding about the dataset and identifying the input and output variables:-

Screenshot 2022-11-06 at 11.29.01 PM.png

We pay attention to:

  • number of columns and the names -> what the data is about and how many variables we going to analyze (prep our mind with the degree of complexity we are going to handle 🥶.)

  • data type of each columns -> which columns should be the input variables, and which should be the target output. Also give an idea about what kind of prediction should be performed..is that regression or classification problem?

  • number of rows -> how many samples we had

From this undertaking:
We learned that this dataset has 4981 rows and 11 columns labeled as
gender, age, hypertension,heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, stroke.

  • Thus, we assumed this dataset was from a 'health' sector and was sampled from 4981 patients.

  • We deduced that this dataset collects 10 parameters (input variables) on patients with and without stroke (1 output variable or target).

  • As such, we going to do a stroke prediction based on the classification problem.

2. Data Cleaning
At this stage, we aimed to ensure our data is consistent across the dataset:

  • no missing data

  • appropriate data type and values (E.g: age, avg_glucose_level,bmi shouldn't have negative values )

  • consistent format (E.g: gender should be either 'Male' or 'Female' only, not a mixture of other representations 'F', 'male', etc)
    self-explain label

  • relevant (no unique values like ID number, names, etc)

The given dataset is actually pretty clean and met the above-mentioned criteria. The only amendment we did was changing the data type of hypertension, heart_disease, and stroke from integer to factor as these variables tell whether the patient has the illness or not (yes/no).

Screenshot 2022-11-06 at 8.33.17 PM.png

3. Exploratory Data Analysis
This stage aims to provide the descriptive (what happened) and diagnostic (why it happened) analysis. We approach the EDA by categorical data, numerical data, and combination of both to understand the relationship:

Categorical data
4eda7ee8-578c-45f3-b378-96db9f4a7a5e-image.png65d9e125-a0a8-4974-ac2a-2c0de2ba14c7-image.png

Distribution of samples by each category:

  • most of the sampled patients are married, working at private sector, has neither hypertension nor heart disease.

  • the sampled patients are almost equally distributed by gender(male or female) and type of residence (rural or urban)

  • majority of the sampled patients are non-smokers (either never smoked or formerly smoked).

Numerical data
5673697e-54e9-4c94-a363-833b5a4625c1-image.png

Range of each variable:

  • The age of sampled population of patients are 0<aged< 85, with most are between 40-60 and ~ 80 years old.

  • The sampled patient bmi are normally distributed with center ~30 (most of the patients have weight issue).

  • The glucose level skewed to the right with obvious 2 peaks at ~100 and ~200 glucose level (most of the patients have high blood sugar).

Understanding the relationship
Screenshot 2022-11-06 at 8.52.38 PM.png
Screenshot 2022-11-06 at 8.56.14 PM.png

Graphs show the number of patient with and without stroke for each factor.

Here we can see the amount of those who have had a stroke is a small portion as compared to those without stroke. In other words, we have imbalanced data with a significant lack of data for patients with stroke.

4768fff5-7b5b-4f86-a801-f429006c19d6-image.png 424e61f0-de4f-4c01-b931-20df83541cb9-image.pngdabe262d-d3f4-480c-854e-9fe481e94213-image.png
Graphs show the proportion of each factor for patients with stroke:

  • Gender and residence type does not appear to have much difference in occurrences of strokes.

  • Those with hypertension, heart disease, or those who have been married have a much higher proportion of their populations having had a stroke.

  • In terms of work type, Children have very low occurrences of strokes. There is little difference in the proportions of those who work in government and those who work in the private sector. Self-employed have a higher proportion of having strokes than other sectors.

  • Current smokers have a higher proportion of their population having had a stroke than those who have never smoked. Former smokers have a higher occurrence of strokes than current smokers. Those with unknown smoking have a low occurrence of strokes.

  • Elderly at age group has higher proportion of having stroke, followed by adult group. Interestingly, occurence of stroke in children is higher than young adult.

  • The proportion of stroke patients that are overweight and obese are high as compared to other bmi class with overweight slightly higher than obese patient. Those who have strokes very seldom are underweight.

Next, we analyze each factor by age, average glucose level, and bmi.

d01249dc-ca55-4b95-b8f5-32706e03e431-image.pngaeb2ec2b-d634-48b0-98ec-519aed312765-image.png003b041e-a85c-44e5-951b-9e6473cc886c-image.png

  • For all levels in each factor those who have had a stroke are older.

  • Those with hypertension and heart disease are older than those who do not. Self-employed are also older than the other types of work.

  • Those who had a stroke and smoke are younger than those who quit or never smoked (but still had a stroke).

14b6081f-79a2-45f9-8235-8ea024088840-image.png a843567c-4acb-4157-951e-12f2a2f0c0c2-image.png0e7a200d-f712-4c8e-bbce-419d7d2e9e6e-image.png
The average glucose level is right skewed.

  • The IQR tends to go higher for those who had a stroke. Those with hypertension and heart disease have higher glucose levels regardless of having a stroke or not.

d987c8a9-e4bf-40e4-8687-c73caa7d6fc2-image.png d70db91a-50f5-4197-971d-92faadfc6c8c-image.png985a45d9-41df-43d3-b6ff-93fce0ab5492-image.png
The graphs show that there is not much difference in the BMI of those who had a stroke and those who had not.

dd8ca2e9-4939-46ed-b4ba-5706237be2ea-image.png
As age increases, the amount of strokes increases.

The distribution for average glucose level is bimodal for both stroke and no stroke populations, with peaks at the same values. However the density of strokes at higher glucose levels is higher than the density of no strokes at the same level.

There is no difference in the distribution of bmi between those who have had a stroke and those who have not..

4. Machine Learning

Split dataset to training and testing set by 80:20
Screenshot 2022-11-06 at 10.26.52 PM.png
21f54803-ea87-450c-b519-f37f6928d645-image.png

1) Logistic regression
Screenshot 2022-11-06 at 10.32.40 PM.png
Screenshot 2022-11-06 at 10.33.43 PM.png
2) Decision tree
Screenshot 2022-11-06 at 10.35.04 PM.png
3) KNN

Screenshot 2022-11-06 at 10.48.04 PM.png

Wow this is very impressive work everyone! #ratutech!!

Group: HADA
Dataset: exams

The data we working out are exams.csv that contains score data for completed tests.

Understanding the data:

From data exploration mode, we can define that :
• gender : gender of the student
• race.ethnicity : group the student ethnicity as Group A to group F
• parental.level.of.education : define the parent level of education of students
• lunch : The type of lunch take by the student
• test.preparation.course : The status of test prepartion course of the student
• math.score : The mathematic test score of the student
• reading.score : The reading test score of the student
• writing.score : The writing test score of the student
Screenshot 2022-11-07 211811.png

e5c0cff3-5a6c-4730-a153-a077c667c359-image.png

Data Cleansing & Analysis:
We filter the data for math.score that more than 80
Screenshot 2022-11-07 212148.png
Screenshot 2022-11-07 212340.png

next, we create new coloumn of average student and their categories based on average score:
Screenshot 2022-11-07 212705.png

Machine Learning:
After analyze and explore the data we visualize the data :
histogram.png

graph gender.png

boxplot.png

math score.png

Group name: MOMO

Our data is a large data set of different types of cars across the American market. We are required to model the price of cars with the available independent variables.
Screenshot 2022-11-08 105552.png

From the data we can see that the horsepower and engine size, positively affect the price of the car. As the horsepower and engine size increasing, the price of the car will also increase.
After that, for machine learning, we used the multiple linear regression because our data has more than 1 input and has a numeric output or target.
Screenshot 2022-11-08 105540.png

The above figure is our histogram chart that was created using R Shiny.

Our data is about the treaments using 5 types of drugs.

To analyse the data we employed ggplot2 packages that available in the R programming language. The data is used to find which drug is better for patient that have different level of blood pressure and cholesterol level.

The histogram below shows the relationship between the age and the types of drugs. Our finding is that the drugs have almost similiar proportions between sexes.

a38ed804-2e88-40b1-9af6-78dea491d03d-image.png

The ML algorithm used is supervised (Classification-Decision Tree) because the target is categorical type of dataset.

37c85ed3-0fbf-4388-8022-f730e727b0a0-image.png

We can conclude that Drug A plays more important role compared to other drugs.

Here, we also plot a histogram for the patient's age used in this observation. The histogram is plotted using R shiny package.

aa1ac719-78d1-4bad-988a-028fce72d21e-Capture.PNG

Hi to all. Here I want to share our experience and some finding of R Programming taught by our easy-to-understand trainer, @saranyaravikumar . Below is some of our coding:
buang column.PNG
tukar factor to number.PNG kira mean for double sahaja.PNG

Next is a few snapshots from our findings:
histogram.PNG boxplot.PNG bar plot.PNG

We were really grateful and enjoy our R Programming lessons! We hope we can retain the knowledge of what we have learnt for our future reference.