Author: Jose Carlos Molano de Oro
University: Pontificia Universidad Javeriana
Course: Multivariate Analysis
Semester: 2022-3
Professor: Marisol Garcia Peña
Author Email: jose_molano@javeriana.edu.co
Professor Email: marisolgarcia@javeriana.edu.co
Contains Medical Data of 155 people with acute or chronic hepatitis give tbe values of 20 variables for each person that, taken together, could predict whether a patient will die or recover from the disease. (Diaconis & Efron,1983)
Hepatitis means inflammation of the liver. The liver is a vital organ that processes nutrients, filters the blood, and fights infections. When the liver is inflamed or damaged, its function can be affected. Heavy alcohol use, toxins, some medications, and certain medical conditions can cause hepatitis. However, hepatitis is often caused by a virus.
Steroids are a man-made version of chemicals, known as hormones, that are made naturally in the human body. Steroids are designed to act like these hormones to reduce inflammation.
Fatigue is a feeling of constant tiredness or weakness and can be physical, mental or a combination of both. It can affect anyone, and most adults will experience fatigue at some point in their life.
Malaise is described as any of the following:
a feeling of overall weakness
a feeling of discomfort
a feeling like you have an illness
simply not feeling well
Anorexia is a potentially life-threatening eating disorder that is characterized by self-starvation and excessive weight loss.
The spleen is a small organ inside your left rib cage, just above the stomach. It’s part of the lymphatic system (which is part of the immune system). The spleen stores and filters blood and makes white blood cells that protect you from infection.
Spider angioma is an abnormal collection of blood vessels near the surface of the skin.
Ascites is a buildup of fluid in your abdomen. It often occurs as a result of cirrhosis, a liver disease
Varices are large or swollen blood vessels, which can be located around the esophagus. The most common cause of esophageal varices is scarring of the liver.
Bilirubin is made during the normal process of breaking down red blood cells. It is a yellowish substance found in bile, a fluid in your liver. This fluid helps digest food.
Alkaline phosphatase is one kind enzyme found in your body. Enzymes are proteins that help chemical reactions happen. For instance, they can break big molecules down into smaller parts, or they can help smaller molecules join together to form bigger structures.
Is an enzyme found in the liver, heart, and other tissues. A high level of SGOT released into the blood may be a sign of liver or heart damage, cancer, or other diseases. Also called aspartate transaminase and serum glutamic-oxaloacetic transaminase.
Albumin is a protein made by the liver. A serum albumin test measures the amount of this protein in the clear liquid portion of the blood.
Prothrombin is a protein produced by your liver. It is one of many factors in your blood that help it to clot appropriately.
Sources
Unknown
Donor: G.Gong (Carnegie-Mellon University)
via Bojan Cestnik
Jozef Stefan Institute
Jamova 39
61000 Ljubljana
Yugoslavia (tel.: (38)(+61) 214-399 ext.287)
Date: November, 1988
Past Usage
Diaconis,P. & Efron,B. (1983). Computer-Intensive Methods in Statistics. Scientific American, Volume 248. – Gail Gong reported a 80% classfication accuracy
Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko & N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press. – Assistant-86: 83% accuracy
Number of Instances: 155
Number of Attributes: 20 (including the class attribute)
Attribute information
Class: DIE, LIVE
AGE: 10, 20, 30, 40, 50, 60, 70, 80
SEX: male, female
STEROID: no, yes
ANTIVIRALS: no, yes
FATIGUE: no, yes
MALAISE: no, yes
ANOREXIA: no, yes
LIVER BIG: no, yes
LIVER FIRM: no, yes
SPLEEN PALPABLE: no, yes
SPIDERS: no, yes
ASCITES: no, yes
VARICES: no, yes
BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00
ALK PHOSPHATE: 33, 80, 120, 160, 200, 250
SGOT: 13, 100, 200, 300, 400, 500,
ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0
PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90
HISTOLOGY: no, yes
Missing Attribute Values: (indicated by “?”)
Class Information
Die: 32
Live: 123
library(DT)
library(reactablefmtr)
hepatitis<-read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data"),header = T)
reactable(hepatitis,rownames = TRUE)
Since the columns of the database have different names than what their attributes indicate, we proceed to change the names of the columns with their respective attribute information. In the same way, the information of its attributes is replaced depending on its type (i.e. Class: Die, Death, …)
colnames(hepatitis)<-c("Class","Age","Sex","Steroid","Antivirals","Fatigue","Malaise","Anorexia","Liver Big","Liver Firm","Spleen Palpable","Spiders","Ascites","Varices","Bilirubin","ALK Phosphate","SGOT","Albumin","Protime","Histology")
hepatitis["Class"][hepatitis["Class"]==1]<-"Die"
hepatitis["Class"][hepatitis["Class"]==2]<-"Live"
hepatitis["Sex"][hepatitis["Sex"]==1]<-"Male"
hepatitis["Sex"][hepatitis["Sex"]==2]<-"Female"
hepatitis[hepatitis==1]<-"No"
hepatitis[hepatitis==2]<-"Yes"
hepatitis[hepatitis=="?"]<-NA
reactable(hepatitis,rownames = TRUE)
summary(hepatitis)
## Class Age Sex Steroid
## Length:154 Min. : 7.00 Length:154 Length:154
## Class :character 1st Qu.:32.00 Class :character Class :character
## Mode :character Median :39.00 Mode :character Mode :character
## Mean :41.27
## 3rd Qu.:50.00
## Max. :78.00
## Antivirals Fatigue Malaise Anorexia
## Length:154 Length:154 Length:154 Length:154
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Liver Big Liver Firm Spleen Palpable Spiders
## Length:154 Length:154 Length:154 Length:154
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Ascites Varices Bilirubin ALK Phosphate
## Length:154 Length:154 Length:154 Length:154
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## SGOT Albumin Protime Histology
## Length:154 Length:154 Length:154 Length:154
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
Note that numeric columns are of character type. Proceeds to convert them to numeric variables
hepatitis$Class<-as.factor(unlist(hepatitis$Class))
hepatitis$Age<-as.factor(unlist(hepatitis$Age))
hepatitis$Sex<-as.factor(unlist(hepatitis$Sex))
hepatitis$Steroid<-as.factor(unlist(hepatitis$Steroid))
hepatitis$Antivirals<-as.factor(unlist(hepatitis$Antivirals))
hepatitis$Fatigue<-as.factor(unlist(hepatitis$Fatigue))
hepatitis$Malaise<-as.factor(unlist(hepatitis$Malaise))
hepatitis$Anorexia<-as.factor(unlist(hepatitis$Anorexia))
hepatitis$`Liver Big`<-as.factor(unlist(hepatitis$`Liver Big`))
hepatitis$`Liver Firm`<-as.factor(unlist(hepatitis$`Liver Firm`))
hepatitis$`Spleen Palpable`<-as.factor(unlist(hepatitis$`Spleen Palpable`))
hepatitis$Spiders<-as.factor(unlist(hepatitis$Spiders))
hepatitis$Ascites<-as.factor(unlist(hepatitis$Ascites))
hepatitis$Varices<-as.factor(unlist(hepatitis$Varices))
hepatitis$Bilirubin<-as.numeric(unlist(hepatitis$Bilirubin))
hepatitis$`ALK Phosphate`<-as.numeric(unlist(hepatitis$`ALK Phosphate`))
hepatitis$SGOT<-as.numeric(unlist(hepatitis$SGOT))
hepatitis$Albumin<-as.numeric(unlist(hepatitis$Albumin))
hepatitis$Protime<-as.numeric(unlist(hepatitis$Protime))
hepatitis$Histology<-as.factor(unlist(hepatitis$Histology))
summary(hepatitis)
## Class Age Sex Steroid Antivirals Fatigue
## Die : 32 34 : 8 Female: 15 No :75 No : 24 No :100
## Live:122 38 : 8 Male :139 Yes :78 Yes:130 Yes : 53
## 30 : 7 NA's: 1 NA's: 1
## 36 : 7
## 39 : 6
## 50 : 6
## (Other):112
## Malaise Anorexia Liver Big Liver Firm Spleen Palpable Spiders
## No :61 No : 32 No : 24 No :60 No : 30 No :51
## Yes :92 Yes :121 Yes :120 Yes :83 Yes :119 Yes :98
## NA's: 1 NA's: 1 NA's: 10 NA's:11 NA's: 5 NA's: 5
##
##
##
##
## Ascites Varices Bilirubin ALK Phosphate SGOT
## No : 20 No : 18 Min. :0.30 Min. : 26.0 Min. : 14.00
## Yes :129 Yes :131 1st Qu.:0.70 1st Qu.: 74.0 1st Qu.: 32.25
## NA's: 5 NA's: 5 Median :1.00 Median : 85.0 Median : 58.00
## Mean :1.43 Mean :105.5 Mean : 86.35
## 3rd Qu.:1.50 3rd Qu.:133.0 3rd Qu.:100.75
## Max. :8.00 Max. :295.0 Max. :648.00
## NA's :6 NA's :29 NA's :4
## Albumin Protime Histology
## Min. :2.100 Min. : 0.00 No :84
## 1st Qu.:3.400 1st Qu.: 46.00 Yes:70
## Median :4.000 Median : 61.00
## Mean :3.816 Mean : 61.85
## 3rd Qu.:4.200 3rd Qu.: 76.25
## Max. :6.400 Max. :100.00
## NA's :16 NA's :66
To deal with numerical missing values in the dataset, proceeds to replace the missing value of the column with the median
hepatitis[,15:19]<-hepatitis[,15:19] %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
summary(hepatitis)
## Class Age Sex Steroid Antivirals Fatigue
## Die : 32 34 : 8 Female: 15 No :75 No : 24 No :100
## Live:122 38 : 8 Male :139 Yes :78 Yes:130 Yes : 53
## 30 : 7 NA's: 1 NA's: 1
## 36 : 7
## 39 : 6
## 50 : 6
## (Other):112
## Malaise Anorexia Liver Big Liver Firm Spleen Palpable Spiders
## No :61 No : 32 No : 24 No :60 No : 30 No :51
## Yes :92 Yes :121 Yes :120 Yes :83 Yes :119 Yes :98
## NA's: 1 NA's: 1 NA's: 10 NA's:11 NA's: 5 NA's: 5
##
##
##
##
## Ascites Varices Bilirubin ALK Phosphate SGOT
## No : 20 No : 18 Min. :0.300 Min. : 26.0 Min. : 14.00
## Yes :129 Yes :131 1st Qu.:0.800 1st Qu.: 78.0 1st Qu.: 33.00
## NA's: 5 NA's: 5 Median :1.000 Median : 85.0 Median : 58.00
## Mean :1.414 Mean :101.6 Mean : 85.61
## 3rd Qu.:1.500 3rd Qu.:119.8 3rd Qu.: 99.50
## Max. :8.000 Max. :295.0 Max. :648.00
##
## Albumin Protime Histology
## Min. :2.100 Min. : 0.00 No :84
## 1st Qu.:3.500 1st Qu.: 57.00 Yes:70
## Median :4.000 Median : 61.00
## Mean :3.835 Mean : 61.49
## 3rd Qu.:4.200 3rd Qu.: 65.50
## Max. :6.400 Max. :100.00
##
This dataset contains the monthly number of reported arrests in the US for various offenses reported by participating law enforcement agencies. The arrests are by offense and broken down by age and sex or age and race. Not all agencies report race and/or ethnicity for arrests but they must report age and sex. Note that only agencies that have reported arrests for 12 months of the year are represented in the annual counts that are included in the database. Download this dataset to see totals of reported arrests for the nation from 1995–2016.
The dataset was taken from de Federal Bureo of Investigation (FBI) Crime Data Explorer
FBI<-read.csv(url("https://s3-us-gov-west-1.amazonaws.com/cg-d3f0433b-a53e-4934-8b94-c678aa2cbaf3/arrests_national.csv"),row.names = 2,h = T)
reactable(FBI,rownames = TRUE)
Since the id attribute is not used, it is removed from the dataset
FBI$id<-NULL
reactable(FBI,rownames = TRUE)
summary(FBI)
## population total_arrests homicide rape
## Min. :262803276 Min. :10662252 Min. :10231 Min. :16863
## 1st Qu.:282395819 1st Qu.:12586911 1st Qu.:11348 1st Qu.:21701
## Median :297952772 Median :13839754 Median :13331 Median :25032
## Mean :295487602 Mean :13418226 Mean :13710 Mean :25205
## 3rd Qu.:311023417 3rd Qu.:14180570 3rd Qu.:14134 3rd Qu.:28083
## Max. :323127513 Max. :15284300 Max. :21230 Max. :34650
## robbery aggravated_assault burglary larceny
## Min. : 94403 Min. :358860 Min. :207325 Min. :1050058
## 1st Qu.:105863 1st Qu.:400402 1st Qu.:288660 1st Qu.:1160498
## Median :108921 Median :442990 Median :295372 Median :1210490
## Mean :116045 Mean :445464 Mean :294936 Mean :1241126
## 3rd Qu.:126438 3rd Qu.:478265 3rd Qu.:304564 3rd Qu.:1279616
## Max. :171870 Max. :568480 Max. :386500 Max. :1530200
## motor_vehicle_theft arson violent_crime property_crime
## Min. : 64566 Min. : 8834 Min. :480360 Min. :1353283
## 1st Qu.: 78934 1st Qu.:11519 1st Qu.:539047 1st Qu.:1606177
## Median :139978 Median :15834 Median :597236 Median :1630406
## Mean :120921 Mean :14733 Mean :600415 Mean :1671715
## 3rd Qu.:148814 3rd Qu.:16759 3rd Qu.:626632 3rd Qu.:1677062
## Max. :191900 Max. :20000 Max. :796250 Max. :2128600
## other_assault forgery fraud embezzlement
## Min. :1078808 Min. : 55333 Min. :128531 Min. :15200
## 1st Qu.:1242966 1st Qu.: 72184 1st Qu.:173134 1st Qu.:16065
## Median :1293424 Median :107777 Median :281816 Median :17100
## Mean :1259624 Mean : 95762 Mean :273572 Mean :17620
## 3rd Qu.:1310566 3rd Qu.:115451 3rd Qu.:343650 3rd Qu.:18852
## Max. :1395800 Max. :122300 Max. :465000 Max. :22381
## stolen_property vandalism weapons prostitution
## Min. : 88576 Min. :191015 Min. :137779 Min. : 38306
## 1st Qu.: 95519 1st Qu.:241417 1st Qu.:157338 1st Qu.: 58676
## Median :121936 Median :275064 Median :167153 Median : 78640
## Mean :118191 Mean :265275 Mean :174857 Mean : 74418
## 3rd Qu.:128090 3rd Qu.:289934 3rd Qu.:190173 3rd Qu.: 87809
## Max. :166500 Max. :320900 Max. :243900 Max. :101600
## other_sex_offenses drug_abuse gambling against_family
## Min. : 51063 Min. :1476100 Min. : 3705 Min. : 88748
## 1st Qu.: 70076 1st Qu.:1533853 1st Qu.: 8900 1st Qu.:111938
## Median : 89082 Median :1576072 Median :10630 Median :127032
## Mean : 81231 Mean :1617127 Mean :10736 Mean :126231
## 3rd Qu.: 93149 3rd Qu.:1674540 3rd Qu.:11916 3rd Qu.:143487
## Max. :101900 Max. :1889810 Max. :21000 Max. :155800
## dui liquor_laws drunkenness disorderly_conduct
## Min. :1017808 Min. :234899 Min. :376433 Min. :369733
## 1st Qu.:1305198 1st Qu.:503684 1st Qu.:537818 1st Qu.:590412
## Median :1434117 Median :611335 Median :566726 Median :647346
## Mean :1364988 Mean :548852 Mean :573149 Mean :626903
## 3rd Qu.:1461434 3rd Qu.:635714 3rd Qu.:632832 3rd Qu.:693571
## Max. :1511300 Max. :683124 Max. :734800 Max. :842600
## vagrancy other suspicion curfew_loitering
## Min. :24851 Min. :3218880 Min. : 576 Min. : 34176
## 1st Qu.:27316 1st Qu.:3553687 1st Qu.: 1451 1st Qu.: 81406
## Median :29076 Median :3724251 Median : 3018 Median :139116
## Mean :29909 Mean :3668659 Mean : 3909 Mean :122666
## 3rd Qu.:33056 3rd Qu.:3832337 3rd Qu.: 5562 3rd Qu.:152130
## Max. :36471 Max. :4022068 Max. :12100 Max. :187800
Data reduction or structural simplification. The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.
Sorting and grouping. Groups of “similar” objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.
Investigation of the dependence among variables. The nature of the relation- ships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent on the others? If so, how?
Prediction. Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observations on the other variables.
Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions.
Can be computed from the \(n\) measures on each of the \(p\) variables, so that, in general, there will be \(p\) sample means:
\[ \overline{X}_{k} = \frac{1}{n}\sum_{j=1}^n x_{jk} \text{ where } k=1,2,3,...,p \]
A measure of spread defined for \(n\) measures on each of the \(p\). We have:
\[ s_k^2 = \frac{1}{n}\sum_{j=1}^n (x_{jk}-\overline{x}_k)^2 \text{ where } k=1,2,3,...,p \]
A measure of linear association between the \(n\) measurements
\[ s_{ik} = \frac{1}{n}\sum_{j=1}^n (x_{ji}-\overline{x}_i)(x_{jk}-\overline{x}_k) \text{ where }i=1,2,3,...,p \text{ and }k=1,2,3,...,p \]
Correlation coefficients are indicators of the strength of the linear relationship between two different variables, \(x\) and \(y\). The sample correlation coefficient for the \(i\)th and \(k\)th variables is defined as
\[ r_{ik}=\frac{s_{ik}}{\sqrt{s_{ii}}\sqrt{s_{kk}}} \text{ where }i=1,2,3,...,p \text{ and }k=1,2,3,...,p \]
For the graphical representation of the following techniques, see the application examples of:
Hepatitis Disease
FBI Arrest Data - Reported Number of Arrests by Crime
A scatter plot uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.
The chi-plot is a graphical representation of the measures of local dependence with an easy interpretation and with more information regarding the usual measures of correlation.
A box-plot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
Bubble Plot are used when data needs a third dimension to provide richer information to viewers. A bubble plot is a relational chart designed to compare three variables.
Unlike other three-dimensional charts that process and represent data across three axes (usually x, y, and z), a bubble chart is represented on two axes (x and y), and the size of the bubble communicates the third, vital piece of information.
Suppose each data unit consists of non-negative observations on \(p\geq2\) variables. In two dimensions, we can construct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle. The lengths of the rays represent the values of the variables. The ends of the rays can be connected with straight lines to form a star. Each star represents a multivariate observation, and the stars can be grouped according to their (subjective) similarities.
People react to faces. Chernoff suggested representing \(p\)-dimensional observations as a two-dimensional face whose characteristics (face shape, mouth curvature, nose length, eye size, pupil position, and so forth) are determined by the measurements on the \(p\) variables.
As originally designed, Chernoff faces can handle up to 18 variables. The assignment of variables to facial features is done by the experimenter, and different choices produce different results. Some iteration is usually necessary before satisfactory representations are achieved.
A 3D scatterplot is a three-dimensional graph that is useful for investigating desirable response values and operating conditions.
A conditioning plot is a scatter plot of two variables when conditioned on a third variable. The third variable is called the conditioning variable. This variable can have both values either continuous or categorical. In the continuous variable, we created subsets by dividing them into a smaller range of values. In categorical variables, the subsets are created based on different categories.
library(dplyr)
library(scatterplot3d)
library(corrplot)
library(ggplot2)
library(GGally)
datatable(as.matrix(sapply(hepatitis[,15:19],function(x) mean(x, na.rm=TRUE))))
datatable(var(hepatitis[,15:19],use = "complete.obs"))
datatable(cov(hepatitis[,15:19],use = "complete.obs"))
c=cor(hepatitis[,15:19])
y=as.data.frame(c)
y[y==1]<-" "
y <- mutate_all(y, function(x) as.numeric(as.character(x)))
reactable(as.data.frame.array((y)),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array((y)))))
corrplot(cor(hepatitis[,15:19],use = "complete.obs"),method="number")
corrplot(cor(hepatitis[,15:19],use = "complete.obs"),method="circle")
hepatitis.rcorr = rcorr(as.matrix(hepatitis[,15:19]))
hepatitis.p=hepatitis.rcorr$P
reactable(as.data.frame.array(hepatitis.p),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array(hepatitis.p))))
Since the values obtained from the correlation are very close to 0, it can be concluded that the values are not correlated with each other.
Since the values obtained from the Pearson correlation test are less than 0.05, all the null hypotheses are rejected.
Based on the values obtained in the correlation, it’s decided to explore the graphic representation between the SGOT and Bilirubin variables
ggplot(hepatitis, aes(x = Bilirubin,
y = SGOT,
color=Class)) +
geom_point() +
labs(title = "Bilirubin vs SGOT by Class")
ggplot(hepatitis, aes(x = Bilirubin,
y = SGOT,
color=Anorexia,
shape=Class)) +
geom_point(
) +
labs(title = "Bilirubin vs SGOT by Class and Anorexia")
ggplot(hepatitis, aes(x = Bilirubin,
y = SGOT,
color=Anorexia,
shape=Antivirals)) +
geom_point(
) +
labs(title = "Bilirubin vs SGOT by Anorexia and Antivirals")
ggplot(hepatitis, aes(x = Bilirubin,
y = SGOT,
color=Fatigue,
size=as.numeric(Age))) +
geom_point(
) +
labs(title = "Bilirubin vs SGOT by Fatigue and Age")
ggplot(hepatitis, aes(x = Bilirubin,
y = SGOT,
color=Class,
size=as.numeric(Age))) +
geom_point(
) +
labs(title = "Bilirubin vs SGOT by Class and Age")
ggscatmat(hepatitis, columns=16:19, color="Class")
library(dplyr)
library(MVA)
library(aplpack)
library(scatterplot3d)
library(corrplot)
library(ggplot2)
datatable(as.matrix(sapply(FBI,function(x) mean(x, na.rm=TRUE))))
reactable(var(FBI,use = "complete.obs"))
reactable(cov(FBI,use = "complete.obs"))
c=cor(FBI)
y=as.data.frame(c)
y[y==1]<-" "
y <- mutate_all(y, function(x) as.numeric(as.character(x)))
reactable(as.data.frame.array((y)),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array((y)))))
corrplot(cor(FBI,use = "complete.obs"),method="circle")
FBI.rcorr = rcorr(as.matrix(FBI))
FBI.p=FBI.rcorr$P
reactable(as.data.frame.array(FBI.p),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array(FBI.p))))
As can be seen in the correlation plot, it can be seen that almost all the variables are correlated with each other.
The values highlighted on the correlation matrix in green, represent the variables that are most correlated with each other.
The values highlighted on the correlation test table in green, represent the highest p-values of variables that are most correlated with each other. This p-values non-reject the nule hypothesis.
Considering the attributes in the correlation matrix that are most correlated positively with each other (values greater than 0.95):
Some of the mentioned attributes are taken to perform several graphs.
ggplot(FBI, aes(x=violent_crime, y=aggravated_assault)) + geom_point()+labs(title = "Violent Crime vs Aggravated Assault")
ggplot(FBI, aes(x=violent_crime, y=aggravated_assault,label=rownames(FBI))) + geom_text()+labs(title = "Violent Crime vs Aggravated Assault using Row Names")
ggplot(FBI, aes(x=violent_crime, y=homicide)) + geom_point()+labs(title = "Violent Crime vs Homicide")
ggplot(FBI, aes(x=violent_crime, y=homicide,label=rownames(FBI))) + geom_text()+labs(title = "Violent Crime vs Homicide using Row Names")
ggplot(FBI, aes(x=violent_crime, y=aggravated_assault,size=homicide)) + geom_point(alpha=0.5)+scale_size(range=c(.1,15))+labs(title = "Violent Crime vs Aggravated Assault and Homicide")
ggplot(FBI, aes(x=violent_crime, y=aggravated_assault,size=fraud)) + geom_point(alpha=0.5)+scale_size(range=c(.1,15))+labs(title = "Violent Crime vs Aggravated Assault and Fraud")
df<-data.frame(FBI$aggravated_assault,FBI$violent_crime,FBI$homicide,FBI$stolen_property,FBI$fraud,FBI$arson,FBI$prostitution,FBI$other_sex_offenses,FBI$drunkenness,FBI$dui,FBI$liquor_laws,FBI$drug_abuse,FBI$curfew_loitering,FBI$embezzlement,FBI$vagrancy)
faces(df, main="United States FBI Arrest Data",face.type=0, print.info=TRUE,labels = rownames(FBI))
## effect of variables:
## modified item Var
## "height of face " "FBI.aggravated_assault"
## "width of face " "FBI.violent_crime"
## "structure of face" "FBI.homicide"
## "height of mouth " "FBI.stolen_property"
## "width of mouth " "FBI.fraud"
## "smiling " "FBI.arson"
## "height of eyes " "FBI.prostitution"
## "width of eyes " "FBI.other_sex_offenses"
## "height of hair " "FBI.drunkenness"
## "width of hair " "FBI.dui"
## "style of hair " "FBI.liquor_laws"
## "height of nose " "FBI.drug_abuse"
## "width of nose " "FBI.curfew_loitering"
## "width of ear " "FBI.embezzlement"
## "height of ear " "FBI.vagrancy"
Recall the univariate theory for determining whether a specific value \(\mu_0\) is a plausible value for the population mean \(\mu\)
View of Hypothesis Testing: Can be formulated as:
\(H_0:\mu=\mu_0\)
\(H_1:\mu\not= \mu_0\)
Where \(H_0\) is the null hypothesis and \(H_1\) the alternative hypothesis
Be a random sample from a normal population \(X_1,…,X_n\), the appropiate test has a student’s \(t\) distribution with n-1 degrees of freedom as:
\[t=\frac{(\overline{X}-\mu_0)}{\frac{s}{\sqrt{n}}}\] where:
\[ \overline{X}=\frac{1}{n}\sum_{j=1}^{n} X_j \text{ and } s^2=\frac{1}{n-1}\sum_{j=1}^n (X_j-\overline{X})^2 \]
Rejecting \(H_0\) when \(\mid t\mid\) is large is equivalent to rejecting \(H_0\) if \(t^2\) is large.
Be \(t^2\) the square distance from the sample mean \(\overline{X}\) to the test value \(\mu_0\):
\[ t^2=\frac{(\overline{X}-\mu_0)^2}{\frac{s^2}{n}}=n(\overline{X}-\mu_0)^2(s^2)^{-1} \]
With the observation of \(\overline{X}\) and \(s^2\), the test becomes: Reject \(H_0\) in favor of \(H_1\) at significance level \(\alpha\) if
\[ n(\overline{X}-\mu_0)^2(s^2)^{-1}>t^2_{n-1}{\alpha}/{2} \]
where \(t_{n-1}(\alpha/2)\) denotes the upper 100\((\alpha/2)\)th percentile of the \(t\)-distribution with \(n-1\) degrees of freedom.
A generalization of the square distance between the sample mean \(\overline{X}\) to the test value \(\mu_0\) is:
\[ T^2=\left(\overline{X}-\mu_0\right)'\left(\frac{1}{n}S\right)^{-1}\left(\overline{X}-\mu_0\right)=n\left(\overline{X}-\mu_0\right)'S^{-1}\left(\overline{X}-\mu_0\right) \]
where:
\[ \overline{X}_{(p\times 1)} = \frac{1}{n}\sum_{j=1}^n X_j, S_{(p\times p)}=\frac{1}{n-1}\sum_{j=1}^n\left(X_j-\overline{X}\right)\left(X_j-\overline{X}\right)' \text{ and } \mu_0=\begin{bmatrix} \mu_{10} \\ \mu_{20} \\ \vdots \\ \mu_{p0} \end{bmatrix} \]
The hypothesis \(H_0:\mu=\mu_0\) is rejected if the observed statistical distance \(T^2\) is too large (i.e. if \(\overline{x}\) is too far from \(\mu_0\) )
The special tables of \(T^2\) turns out when the percentage points are not required for formal test of hypothesis. It’s true because:
\[T^2 \text{is distributed as }\frac{(n-1)p}{n-p}F_{p,n-p}\] where \(F_{p,n-p}\) denotes a random variable with an \(F-\) distribution with \(p\) and \(n-1\) degrees of freedom.
Let \(\mathbf{X}_1…,\mathbf{X}_n\) be a random sample from an \(N_p(\mu,\Sigma)\) population. Then with \(\mathbf{\overline{X}}=\frac{1}{n}\sum_{j=1}^n\mathbf{X}_j\) and \(S=\frac{1}{n-1}\sum_{j=1}^n(X_j-\overline{X})(X_j-\overline{X})'\),
\[\alpha=P\left[T^2>\frac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\right]\]
\[ =P\left[n\left(\overline{X}-\mu\right)'S^{-1}\left(\overline{X}-\mu\right)>\frac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\right] \]
whatever the true \(\mu\) and \(\Sigma\). Here \(F_{p,n-p}(\alpha)\) is the upper (100\(\alpha\))th percentile of the \(F_{p,n-p}\) distribution.
library(Hotelling)
t2testsparr <- hotelling.test(hepatitis$Bilirubin + hepatitis$`ALK Phosphate` + hepatitis$SGOT +
hepatitis$Albumin + hepatitis$Protime ~ hepatitis$Class)
t2testsparr
## Test stat: 71.979
## Numerator df: 5
## Denominator df: 148
## P-value: 3.243e-11
The Hotteling \(T^2\) value was statistically significant (i.e. there is evidence about mean difference with live and death people considering the 5 numeric attributes values)
t2testsparr <- hotelling.test(hepatitis$Bilirubin + hepatitis$`ALK Phosphate` + hepatitis$SGOT +
hepatitis$Albumin + hepatitis$Protime ~ hepatitis$Sex)
t2testsparr
## Test stat: 2.5208
## Numerator df: 5
## Denominator df: 148
## P-value: 0.7827
The Hotteling \(T^2\) value was not statistically significant (i.e. there is evidence about mean difference with Male and Female people considering the 5 numeric attributes values)
The Wishart distribution is a family of distributions for symmetric positive definite matrices. Let \(\mathbf{X_1,X_2,…,X_n}\) be independent \(N_p(\mathbf{0},\Sigma)\) and form a \(p × n\) data matrix \(X = [X_1,...,X_n]\). The distribution of a \(p × p\) random matrix \(\mathbf{M = XX}′=\sum_{i=1}^n \mathbf{X}_i\mathbf{X}_i'\) is said to have the Wishart distribution.
The random matrix \(\mathbf{M}(p×p) = \sum_{i=1}^n \mathbf{X}_i\mathbf{X}_i'\) has the Wishart distribution with n degrees of freedom and covariance matrix \(\Sigma\) and is denoted by \(\mathbf{M} ∼ W_p(n, \Sigma)\). For n ≥ p, the probability density function of M is
\[ f(\mathbf{X}) = \frac{1}{2^{np/2}\Gamma_p \left(\frac{n}{2}\right) \left| \Sigma \right|^{n/2}} \left| \mathbf{M} \right|^{(n-1-p)/2}\exp \left( -\frac{1}{2}\text{trace}(\Sigma^{-1}\mathbf{M}) \right) \]
with respect to Lebesque measure on the cone of symmetric positive definite matrices. Here, \(\Gamma_p(\alpha)\) is the multivariate gamma function.
The precise form of the density is rarely used. Two exceptions are:
In Bayesian computation, the Wishart distribution is often used as a conjugate prior for the inverse of normal covariance matrix.
When symmetric positive definite matrices are the random elements of interest in diffusion tensor study.
The Wishart distribution is a multivariate extension of \(\chi^2\) distribution. In particular, if \(\mathbf{M} ∼ W_1(n, \sigma_2)\), then \(\mathbf{M}/σ^2 ∼ \chi^2\)n. For a special case \(Σ = \mathbb{I}, W_p(n, \mathbb{I})\) is called the standard Wishart distribution.
PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.
The covariance matrix is a \(p\times p\) symmetric matrix (where \(p\) is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables.
The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them.
Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.
Organizing information in principal components will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.
An important thing to realize here is that the principal components are less interpretable and don’t have any real meaning since they are constructed as linear combinations of the initial variables.
Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance, that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has.
What we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.
The aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis). This can be done by multiplying the transpose of the original data set by the transpose of the feature vector.
library(Hmisc)
library(psych)
library(bcv)
cor(hepatitis[,15:19])
## Bilirubin ALK Phosphate SGOT Albumin Protime
## Bilirubin 1.0000000 0.1325704 0.2352884 -0.3701577 -0.2220303
## ALK Phosphate 0.1325704 1.0000000 0.1831043 -0.3358635 -0.1856398
## SGOT 0.2352884 0.1831043 1.0000000 -0.1065261 -0.1375747
## Albumin -0.3701577 -0.3358635 -0.1065261 1.0000000 0.2964787
## Protime -0.2220303 -0.1856398 -0.1375747 0.2964787 1.0000000
heppca<- prcomp(hepatitis[,15:19],scale=TRUE)
summary(heppca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.3797 0.9633 0.9317 0.8820 0.7228
## Proportion of Variance 0.3807 0.1856 0.1736 0.1556 0.1045
## Cumulative Proportion 0.3807 0.5663 0.7399 0.8955 1.0000
plot(heppca, xlab="Components")
plot(heppca, type="lines")
numcomp<-cv.svd.gabriel(cor(hepatitis[,15:19]), krow = 2, kcol = 2, maxrank = 2)
numcomp
##
## Call:
## cv.svd.gabriel(x = cor(hepatitis[, 15:19]), krow = 2, kcol = 2, maxrank = 2)
##
## Rank MSEP SE
## ----------------------
## 0 0.2412 0.07661 *+
## 1 0.3435 0.16317
## 2 4.0134 3.33413
From the above result, 1 component is selected
numcomp1<-cv.svd.wold(cor(hepatitis[,15:19]), k=5, maxrank=5)
numcomp1
##
## Call:
## cv.svd.wold(x = cor(hepatitis[, 15:19]), k = 5, maxrank = 5)
##
## Rank MSEP SE
## ----------------------
## 0 0.2447 0.08581 *+
## 1 0.2628 0.04207
## 2 0.7866 0.19539
## 3 0.7994 0.26247
## 4 0.5368 0.10790
## 5 0.3441 0.09534
From the above result, 1 component is selected
biplot(heppca,cex=c(0.4,0.5),expand = 1)
hep.cpr <- heppca$rotation %*% diag(heppca$sdev)
barplot(t(hep.cpr), beside = TRUE, ylim = c(-1, 1))
par(mfrow=c(1,2))
plot(heppca$x[,1],hepatitis$Bilirubin,xlab="PC1")
plot(heppca$x[,2],hepatitis$Bilirubin,xlab="PC2")
modlin<- lm(hepatitis$Bilirubin~heppca$x[,1]+heppca$x[,2])
summary(modlin)
##
## Call:
## lm(formula = hepatitis$Bilirubin ~ heppca$x[, 1] + heppca$x[,
## 2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4345 -0.4252 -0.0635 0.2554 4.2544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.41364 0.07066 20.007 < 2e-16 ***
## heppca$x[, 1] 0.56990 0.05138 11.092 < 2e-16 ***
## heppca$x[, 2] -0.23219 0.07359 -3.155 0.00194 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8768 on 151 degrees of freedom
## Multiple R-squared: 0.4683, Adjusted R-squared: 0.4612
## F-statistic: 66.49 on 2 and 151 DF, p-value: < 2.2e-16
Since the \(R^2=0.4683\), the linear regression model is rejected.
For simplicity is considering the attributes in the correlation matrix that are most correlated positively with each other (values greater than 0.95):
library(Hmisc)
library(psych)
library(bcv)
df<-data.frame(FBI$aggravated_assault,FBI$violent_crime,FBI$homicide,FBI$stolen_property,FBI$fraud,FBI$arson,FBI$prostitution,FBI$other_sex_offenses,FBI$drunkenness,FBI$dui,FBI$liquor_laws,FBI$drug_abuse,FBI$curfew_loitering,FBI$embezzlement,FBI$vagrancy,row.names = rownames(FBI))
c=cor(df)
y=as.data.frame(c)
y[y==1]<-" "
y <- mutate_all(y, function(x) as.numeric(as.character(x)))
reactable(as.data.frame.array((y)),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array((y)))))
FBI.rcorr = rcorr(as.matrix(df))
FBI.p=FBI.rcorr$P
reactable(as.data.frame.array(FBI.p),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array(FBI.p))))
cortest.bartlett(df)
## R was not square, finding R from data
## $chisq
## [1] 689.2364
##
## $p.value
## [1] 1.070471e-86
##
## $df
## [1] 105
Since the p-value is less than 0.05, the null hypothesis is rejected.
The Kaiser-Meyer-Olkin (KMO) Test is a test that is used to decide whether our samples are suitable for conducting factor analysis. Factor analysis in statistics is about identifying underlying factors or causes that can be used to represent the relationship between two or more variables.
KMO(df)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = df)
## Overall MSA = 0.73
## MSA for each item =
## FBI.aggravated_assault FBI.violent_crime FBI.homicide
## 0.70 0.65 0.78
## FBI.stolen_property FBI.fraud FBI.arson
## 0.83 0.74 0.87
## FBI.prostitution FBI.other_sex_offenses FBI.drunkenness
## 0.76 0.75 0.72
## FBI.dui FBI.liquor_laws FBI.drug_abuse
## 0.73 0.75 0.50
## FBI.curfew_loitering FBI.embezzlement FBI.vagrancy
## 0.72 0.67 0.46
Since the KMO Overall Value is 0.73 and when the value is above 0.5. This indicate that the sampling is adequate and is well suited for PCA.
reactable(var(df,use = "complete.obs"))
FBIPCA<- prcomp(df,scale=T)
summary(FBIPCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.2888 1.6792 0.71296 0.63426 0.44638 0.30214 0.2192
## Proportion of Variance 0.7211 0.1880 0.03389 0.02682 0.01328 0.00609 0.0032
## Cumulative Proportion 0.7211 0.9091 0.94296 0.96978 0.98306 0.98915 0.9923
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.19477 0.16915 0.15033 0.11660 0.08758 0.05124 0.03620
## Proportion of Variance 0.00253 0.00191 0.00151 0.00091 0.00051 0.00018 0.00009
## Cumulative Proportion 0.99488 0.99679 0.99829 0.99920 0.99971 0.99989 0.99997
## PC15
## Standard deviation 0.01957
## Proportion of Variance 0.00003
## Cumulative Proportion 1.00000
plot(FBIPCA)
plot(FBIPCA,type="lines")
biplot(FBIPCA,cex=c(0.4,0.5),expand = 1)
The average year is 2003 their values with respect to the variables with more weight in each component are closer to their respective averages.
Since 2013, there are the lowest values with respect to the variables.
It can be said that crimes have decreased over the years.
cv.svd.gabriel(x = cov(df), krow = 15, kcol = 15, maxrank = 14)
##
## Call:
## cv.svd.gabriel(x = cov(df), krow = 15, kcol = 15, maxrank = 14)
##
## Rank MSEP SE
## ---------------------------
## 0 2.080e+19 4.025e+18
## 1 1.790e+18 8.872e+17
## 2 9.589e+17 4.244e+17
## 3 6.019e+17 3.040e+17
## 4 9.110e+17 3.515e+17
## 5 5.098e+17 2.348e+17
## 6 1.555e+17 4.672e+16
## 7 1.779e+17 5.025e+16
## 8 3.242e+17 1.244e+17
## 9 3.956e+17 1.848e+17
## 10 6.420e+16 1.890e+16
## 11 2.020e+16 5.785e+15 *+
## 12 2.465e+16 6.877e+15
## 13 1.153e+17 3.992e+16
## 14 3.255e+17 1.631e+17
From the above result, 12 components are selected
cv.svd.wold(cov(df), k=15, maxrank=14)
##
## Call:
## cv.svd.wold(x = cov(df), k = 15, maxrank = 14)
##
## Rank MSEP SE
## ---------------------------
## 0 2.080e+19 3.320e+18
## 1 1.805e+18 9.038e+17
## 2 8.569e+17 3.359e+17 *+
## 3 4.197e+18 1.372e+18
## 4 8.544e+18 1.738e+18
## 5 1.035e+19 1.543e+18
## 6 1.193e+19 1.989e+18
## 7 1.229e+19 1.982e+18
## 8 1.242e+19 1.994e+18
## 9 1.249e+19 1.993e+18
## 10 1.250e+19 1.992e+18
## 11 1.251e+19 1.992e+18
## 12 1.251e+19 1.992e+18
## 13 1.251e+19 1.992e+18
## 14 1.251e+19 1.992e+18
From the above result, 3 components are selected
Centers for Disease Control and Prevention. (2020, July 28). What is viral hepatitis? Centers for Disease Control and Prevention. Retrieved August 21, 2022, from https://www.cdc.gov/hepatitis/abc/index.htm
Steroids | Side-effects, uses, time to work. (n.d.). Versus Arthritis; www.versusarthritis.org. Retrieved August 21, 2022, from https://www.versusarthritis.org/about-arthritis/treatments/drugs/steroids/@ClevelandClinic. (n.d.).
Antivirals: Antiviral Medication, What they treat & How they work. Cleveland Clinic; my.clevelandclinic.org. Retrieved August 21, 2022, from https://my.clevelandclinic.org/health/drugs/21531-antiviralsLuo, E. K. (2019, July 4). Malaise:
Causes, Diagnosis and Treatments. Healthline; www.healthline.com. https://www.healthline.com/health/malaiseBhandari, S. (2020, July 20). Anorexia Nervosa:Symptoms,
Causes, Diagnosis, Treatment. WebMD; www.webmd.com. https://www.webmd.com/mental-health/eating-disorders/anorexia-nervosa/mental-health-anorexia-nervosa
Luo, Elaine K. “Malaise: Causes, Diagnosis and Treatments.” Healthline. www.healthline.com, July 4, 2019. https://www.healthline.com/health/malaise.Bhandari, Smitha . “Anorexia Nervosa:Symptoms
Causes, Diagnosis, Treatment.” WebMD. www.webmd.com, July 20, 2020. https://www.webmd.com/mental-health/eating-disorders/anorexia-nervosa/mental-health-anorexia-nervosa.Versus
Arthritis. “Steroids | Side-Effects, Uses, Time to Work.” www.versusarthritis.org. Accessed August 21, 2022. https://www.versusarthritis.org/about-arthritis/treatments/drugs/steroids/.@ClevelandClinic.
“Antivirals: Antiviral Medication, What They Treat & How They Work.” Cleveland Clinic. my.clevelandclinic.org. Accessed August 21, 2022. https://my.clevelandclinic.org/health/drugs/21531-antivirals.@ClevelandClinic.
“Spleen: Spleen Function, Enlarged Spleen, What Does the Spleen Do.” Cleveland Clinic. my.clevelandclinic.org. Accessed August 21, 2022. https://my.clevelandclinic.org/health/body/21567-spleen.@ClevelandClinic.
“Ascites: Fluid Buildup, Causes, Symptoms & Treatment.” Cleveland Clinic. my.clevelandclinic.org. Accessed August 21, 2022. https://my.clevelandclinic.org/health/diseases/14792-ascites.Spider angioma
Information | Mount Sinai - New York. “Spider Angioma Information | Mount Sinai - New York.” www.mountsinai.org. Accessed August 21, 2022. https://www.mountsinai.org/health-library/diseases-conditions/spider-angioma.@ClevelandClinic.
“Bilirubin Test: Test Details & Results.” Cleveland Clinic. my.clevelandclinic.org. Accessed August 21, 2022. https://my.clevelandclinic.org/health/diagnostics/17845-bilirubin
Sethi Saurabh . “What Are Esophageal Varices? Types, Treatments, and More.” What are esophageal varices? Types, treatments, and more. www.medicalnewstoday.com, May 29, 2020. https://www.medicalnewstoday.com/articles/esophageal-varices
Robinson Jennifer . “Alkaline Phosphatase Test (ALP): High vs. Low Levels.” WebMD. www.webmd.com, May 20, 2021. https://www.webmd.com/digestive-disorders/alkaline_phosphatase_test.National
Cancer Institute. “NCI Dictionary of Cancer Terms.” www.cancer.gov. Accessed August 21, 2022. https://www.cancer.gov/publications/dictionaries/cancer-terms/def/sgot
Albumin blood (serum) test Information | Mount Sinai - New York. “Albumin - Blood (Serum) Test Information | Mount Sinai - New York.” www.mountsinai.org. Accessed August 21, 2022. https://www.mountsinai.org/health-library/tests/albumin-blood-serum-test.
Prothrombin time test Mayo Clinic. “Prothrombin Time Test - Mayo Clinic.” www.mayoclinic.org, December 8, 2020. https://www.mayoclinic.org/tests-procedures/prothrombin-time/about/pac-20384661.
Federal Bureau Investigation, ed. “Crime Data Explorer.” FBI. crime-data-explorer.fr.cloud.gov. Accessed August 21, 2022. https://crime-data-explorer.fr.cloud.gov/pages/downloads#datasets
Gong, Gail . “Hepatitis.” UC Irvine Machine Learning Repository, November 1, 1988. https://archive-beta.ics.uci.edu/ml/datasets/hepatitis
Federal Bureau Investigation, ed. “FBI National Arrests.” FBI. Accessed August 21, 2022. https://s3-us-gov-west-1.amazonaws.com/cg-d3f0433b-a53e-4934-8b94-c678aa2cbaf3/arrests_national.csv
Investopedia. “Correlation Coefficients: Positive, Negative, & Zero.” www.investopedia.com, May 31, 2021. https://www.investopedia.com/ask/answers/032515/what-does-it-mean-if-correlation-coefficient-positive-negative-or-zero.asp
Yi, Mike. “Scatter Plots | A Complete Guide to Scatter Plots.” Chartio. chartio.com. Accessed August 21, 2022. https://chartio.com/learn/charts/what-is-a-scatter-plot/
TIBCO Software. “What Is a Bubble Chart? | TIBCO Software.” www.tibco.com. Accessed August 21, 2022. https://www.tibco.com/reference-center/what-is-a-bubble-chart
Galarnyk, Michael. “Understanding Boxplots. The Image above Is a Boxplot. A Boxplot… | by Michael Galarnyk | Towards Data Science.” Medium. towardsdatascience.com, July 6, 2020. https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51
Interpret the key results for 3D Scatterplot - Minitab. “Interpret the Key Results for 3D Scatterplot - Minitab.” support.minitab.com. Accessed August 21, 2022. https://support.minitab.com/en-us/minitab/21/help-and-how-to/graphs/3d-scatterplot/interpret-the-results/key-results/
GeeksforGeeks. “Conditioning Plot - GeeksforGeeks.” www.geeksforgeeks.org, March 7, 2021. https://www.geeksforgeeks.org/conditioning-plot/
Jaadi, Zakaria . “A Step-by-Step Explanation of Principal Component Analysis (PCA) | Built In.” Built In. builtin.com, April 1, 2021. https://builtin.com/data-science/step-step-explanation-principal-component-analysis
Admin. “KMO Test (for Measuring Sampling Adequacy) - All Things Statistics.” All Things Statistics. allthingsstatistics.com, July 17, 2021. https://allthingsstatistics.com/inferential-statistics/kmo-test/
Viliam Simko, Taiyun Wei. “An Introduction to Corrplot Package.” An Introduction to corrplot Package. cran.r-project.org, November 18, 2021. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
Kabacoff, Rob. “Data Visualization with R.” Data Visualization with R. rkabacoff.github.io. Accessed August 21, 2022. https://rkabacoff.github.io/datavis/Multivariate.html.
Vitor A. A. Marchi, Francisco A. R. Rojas, Francisco Louzada, The Chi-plot and Its Asymptotic Confidence Interval for Analyzing Bivariate Dependence: An Application to the Average Intelligence and Atheism Rates across Nations Data, J. data sci. 10(2022), no. 4, 711-722, DOI 10.6339/JDS.2012.10(4).1094