About Information

  • Author: Jose Carlos Molano de Oro

  • University: Pontificia Universidad Javeriana

  • Course: Multivariate Analysis

  • Semester: 2022-3

  • Professor: Marisol Garcia Peña

  • Author Email: jose_molano@javeriana.edu.co

  • Professor Email: marisolgarcia@javeriana.edu.co

Hepatitis Disease Dataset

About the Dataset

Contains Medical Data of 155 people with acute or chronic hepatitis give tbe values of 20 variables for each person that, taken together, could predict whether a patient will die or recover from the disease. (Diaconis & Efron,1983)

Definitions

Hepatitis

Hepatitis means inflammation of the liver. The liver is a vital organ that processes nutrients, filters the blood, and fights infections. When the liver is inflamed or damaged, its function can be affected. Heavy alcohol use, toxins, some medications, and certain medical conditions can cause hepatitis. However, hepatitis is often caused by a virus.

Steroids

Steroids are a man-made version of chemicals, known as hormones, that are made naturally in the human body. Steroids are designed to act like these hormones to reduce inflammation.

Antivirals

Antivirals are medications that help your body fight off certain viruses that can cause disease. Antiviral drugs are also preventive. They can protect you from getting viral infections or spreading a virus to others.

Fatigue

Fatigue is a feeling of constant tiredness or weakness and can be physical, mental or a combination of both. It can affect anyone, and most adults will experience fatigue at some point in their life.

Malaise

Malaise is described as any of the following:

  • a feeling of overall weakness

  • a feeling of discomfort

  • a feeling like you have an illness

  • simply not feeling well

Anorexia

Anorexia is a potentially life-threatening eating disorder that is characterized by self-starvation and excessive weight loss.

Spleen

The spleen is a small organ inside your left rib cage, just above the stomach. It’s part of the lymphatic system (which is part of the immune system). The spleen stores and filters blood and makes white blood cells that protect you from infection.

Spider Angioma

Spider angioma is an abnormal collection of blood vessels near the surface of the skin.

Ascites

Ascites is a buildup of fluid in your abdomen. It often occurs as a result of cirrhosis, a liver disease

Varices

Varices are large or swollen blood vessels, which can be located around the esophagus. The most common cause of esophageal varices is scarring of the liver.

Bilirubin

Bilirubin is made during the normal process of breaking down red blood cells. It is a yellowish substance found in bile, a fluid in your liver. This fluid helps digest food.

ALK Phosphate

Alkaline phosphatase is one kind enzyme found in your body. Enzymes are proteins that help chemical reactions happen. For instance, they can break big molecules down into smaller parts, or they can help smaller molecules join together to form bigger structures.

SGOT

Is an enzyme found in the liver, heart, and other tissues. A high level of SGOT released into the blood may be a sign of liver or heart damage, cancer, or other diseases. Also called aspartate transaminase and serum glutamic-oxaloacetic transaminase.

Albumin

Albumin is a protein made by the liver. A serum albumin test measures the amount of this protein in the clear liquid portion of the blood.

Prothrombin

Prothrombin is a protein produced by your liver. It is one of many factors in your blood that help it to clot appropriately.

Dataset Description

  1. Sources

    • Unknown

    • Donor: G.Gong (Carnegie-Mellon University)

      • via Bojan Cestnik

      • Jozef Stefan Institute

      • Jamova 39

      • 61000 Ljubljana

      • Yugoslavia (tel.: (38)(+61) 214-399 ext.287)

    • Date: November, 1988

  2. Past Usage

    • Diaconis,P. & Efron,B. (1983). Computer-Intensive Methods in Statistics. Scientific American, Volume 248. – Gail Gong reported a 80% classfication accuracy

    • Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko & N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press. – Assistant-86: 83% accuracy

  3. Number of Instances: 155

  4. Number of Attributes: 20 (including the class attribute)

  5. Attribute information

    • Class: DIE, LIVE

    • AGE: 10, 20, 30, 40, 50, 60, 70, 80

    • SEX: male, female

    • STEROID: no, yes

    • ANTIVIRALS: no, yes

    • FATIGUE: no, yes

    • MALAISE: no, yes

    • ANOREXIA: no, yes

    • LIVER BIG: no, yes

    • LIVER FIRM: no, yes

    • SPLEEN PALPABLE: no, yes

    • SPIDERS: no, yes

    • ASCITES: no, yes

    • VARICES: no, yes

    • BILIRUBIN: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00

    • ALK PHOSPHATE: 33, 80, 120, 160, 200, 250

    • SGOT: 13, 100, 200, 300, 400, 500,

    • ALBUMIN: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0

    • PROTIME: 10, 20, 30, 40, 50, 60, 70, 80, 90

    • HISTOLOGY: no, yes

  6. Missing Attribute Values: (indicated by “?”)

  7. Class Information

    • Die: 32

    • Live: 123

Data Curation

R Libraries

library(DT)
library(reactablefmtr)

Initial Data Table

hepatitis<-read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data"),header = T)

reactable(hepatitis,rownames = TRUE)

Since the columns of the database have different names than what their attributes indicate, we proceed to change the names of the columns with their respective attribute information. In the same way, the information of its attributes is replaced depending on its type (i.e. Class: Die, Death, …)

colnames(hepatitis)<-c("Class","Age","Sex","Steroid","Antivirals","Fatigue","Malaise","Anorexia","Liver Big","Liver Firm","Spleen Palpable","Spiders","Ascites","Varices","Bilirubin","ALK Phosphate","SGOT","Albumin","Protime","Histology")

hepatitis["Class"][hepatitis["Class"]==1]<-"Die" 
hepatitis["Class"][hepatitis["Class"]==2]<-"Live" 
hepatitis["Sex"][hepatitis["Sex"]==1]<-"Male"
hepatitis["Sex"][hepatitis["Sex"]==2]<-"Female"
hepatitis[hepatitis==1]<-"No"
hepatitis[hepatitis==2]<-"Yes"
hepatitis[hepatitis=="?"]<-NA
reactable(hepatitis,rownames = TRUE)

Dataset Summary

summary(hepatitis)
##     Class                Age            Sex              Steroid         
##  Length:154         Min.   : 7.00   Length:154         Length:154        
##  Class :character   1st Qu.:32.00   Class :character   Class :character  
##  Mode  :character   Median :39.00   Mode  :character   Mode  :character  
##                     Mean   :41.27                                        
##                     3rd Qu.:50.00                                        
##                     Max.   :78.00                                        
##   Antivirals          Fatigue            Malaise            Anorexia        
##  Length:154         Length:154         Length:154         Length:154        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Liver Big          Liver Firm        Spleen Palpable      Spiders         
##  Length:154         Length:154         Length:154         Length:154        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Ascites            Varices           Bilirubin         ALK Phosphate     
##  Length:154         Length:154         Length:154         Length:154        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      SGOT             Albumin            Protime           Histology        
##  Length:154         Length:154         Length:154         Length:154        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

Note that numeric columns are of character type. Proceeds to convert them to numeric variables

hepatitis$Class<-as.factor(unlist(hepatitis$Class))
hepatitis$Age<-as.factor(unlist(hepatitis$Age))
hepatitis$Sex<-as.factor(unlist(hepatitis$Sex))
hepatitis$Steroid<-as.factor(unlist(hepatitis$Steroid))
hepatitis$Antivirals<-as.factor(unlist(hepatitis$Antivirals))
hepatitis$Fatigue<-as.factor(unlist(hepatitis$Fatigue))
hepatitis$Malaise<-as.factor(unlist(hepatitis$Malaise))
hepatitis$Anorexia<-as.factor(unlist(hepatitis$Anorexia))
hepatitis$`Liver Big`<-as.factor(unlist(hepatitis$`Liver Big`))
hepatitis$`Liver Firm`<-as.factor(unlist(hepatitis$`Liver Firm`))
hepatitis$`Spleen Palpable`<-as.factor(unlist(hepatitis$`Spleen Palpable`))
hepatitis$Spiders<-as.factor(unlist(hepatitis$Spiders))
hepatitis$Ascites<-as.factor(unlist(hepatitis$Ascites))
hepatitis$Varices<-as.factor(unlist(hepatitis$Varices))
hepatitis$Bilirubin<-as.numeric(unlist(hepatitis$Bilirubin))
hepatitis$`ALK Phosphate`<-as.numeric(unlist(hepatitis$`ALK Phosphate`))
hepatitis$SGOT<-as.numeric(unlist(hepatitis$SGOT))
hepatitis$Albumin<-as.numeric(unlist(hepatitis$Albumin))
hepatitis$Protime<-as.numeric(unlist(hepatitis$Protime))
hepatitis$Histology<-as.factor(unlist(hepatitis$Histology))
summary(hepatitis)
##   Class          Age          Sex      Steroid   Antivirals Fatigue   
##  Die : 32   34     :  8   Female: 15   No  :75   No : 24    No  :100  
##  Live:122   38     :  8   Male  :139   Yes :78   Yes:130    Yes : 53  
##             30     :  7                NA's: 1              NA's:  1  
##             36     :  7                                               
##             39     :  6                                               
##             50     :  6                                               
##             (Other):112                                               
##  Malaise   Anorexia   Liver Big  Liver Firm Spleen Palpable Spiders  
##  No  :61   No  : 32   No  : 24   No  :60    No  : 30        No  :51  
##  Yes :92   Yes :121   Yes :120   Yes :83    Yes :119        Yes :98  
##  NA's: 1   NA's:  1   NA's: 10   NA's:11    NA's:  5        NA's: 5  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  Ascites    Varices      Bilirubin    ALK Phosphate        SGOT       
##  No  : 20   No  : 18   Min.   :0.30   Min.   : 26.0   Min.   : 14.00  
##  Yes :129   Yes :131   1st Qu.:0.70   1st Qu.: 74.0   1st Qu.: 32.25  
##  NA's:  5   NA's:  5   Median :1.00   Median : 85.0   Median : 58.00  
##                        Mean   :1.43   Mean   :105.5   Mean   : 86.35  
##                        3rd Qu.:1.50   3rd Qu.:133.0   3rd Qu.:100.75  
##                        Max.   :8.00   Max.   :295.0   Max.   :648.00  
##                        NA's   :6      NA's   :29      NA's   :4       
##     Albumin         Protime       Histology
##  Min.   :2.100   Min.   :  0.00   No :84   
##  1st Qu.:3.400   1st Qu.: 46.00   Yes:70   
##  Median :4.000   Median : 61.00            
##  Mean   :3.816   Mean   : 61.85            
##  3rd Qu.:4.200   3rd Qu.: 76.25            
##  Max.   :6.400   Max.   :100.00            
##  NA's   :16      NA's   :66

To deal with numerical missing values in the dataset, proceeds to replace the missing value of the column with the median

hepatitis[,15:19]<-hepatitis[,15:19] %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
summary(hepatitis)
##   Class          Age          Sex      Steroid   Antivirals Fatigue   
##  Die : 32   34     :  8   Female: 15   No  :75   No : 24    No  :100  
##  Live:122   38     :  8   Male  :139   Yes :78   Yes:130    Yes : 53  
##             30     :  7                NA's: 1              NA's:  1  
##             36     :  7                                               
##             39     :  6                                               
##             50     :  6                                               
##             (Other):112                                               
##  Malaise   Anorexia   Liver Big  Liver Firm Spleen Palpable Spiders  
##  No  :61   No  : 32   No  : 24   No  :60    No  : 30        No  :51  
##  Yes :92   Yes :121   Yes :120   Yes :83    Yes :119        Yes :98  
##  NA's: 1   NA's:  1   NA's: 10   NA's:11    NA's:  5        NA's: 5  
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  Ascites    Varices      Bilirubin     ALK Phosphate        SGOT       
##  No  : 20   No  : 18   Min.   :0.300   Min.   : 26.0   Min.   : 14.00  
##  Yes :129   Yes :131   1st Qu.:0.800   1st Qu.: 78.0   1st Qu.: 33.00  
##  NA's:  5   NA's:  5   Median :1.000   Median : 85.0   Median : 58.00  
##                        Mean   :1.414   Mean   :101.6   Mean   : 85.61  
##                        3rd Qu.:1.500   3rd Qu.:119.8   3rd Qu.: 99.50  
##                        Max.   :8.000   Max.   :295.0   Max.   :648.00  
##                                                                        
##     Albumin         Protime       Histology
##  Min.   :2.100   Min.   :  0.00   No :84   
##  1st Qu.:3.500   1st Qu.: 57.00   Yes:70   
##  Median :4.000   Median : 61.00            
##  Mean   :3.835   Mean   : 61.49            
##  3rd Qu.:4.200   3rd Qu.: 65.50            
##  Max.   :6.400   Max.   :100.00            
## 

FBI Arrest Data - Reported Number of Arrests by Crime

About the Dataset

This dataset contains the monthly number of reported arrests in the US for various offenses reported by participating law enforcement agencies. The arrests are by offense and broken down by age and sex or age and race. Not all agencies report race and/or ethnicity for arrests but they must report age and sex. Note that only agencies that have reported arrests for 12 months of the year are represented in the annual counts that are included in the database. Download this dataset to see totals of reported arrests for the nation from 1995–2016.

The dataset was taken from de Federal Bureo of Investigation (FBI) Crime Data Explorer

Data Curation

Initial Data Table

FBI<-read.csv(url("https://s3-us-gov-west-1.amazonaws.com/cg-d3f0433b-a53e-4934-8b94-c678aa2cbaf3/arrests_national.csv"),row.names = 2,h = T)

reactable(FBI,rownames = TRUE)

Since the id attribute is not used, it is removed from the dataset

FBI$id<-NULL
reactable(FBI,rownames = TRUE)

Dataset Summary

summary(FBI)
##    population        total_arrests         homicide          rape      
##  Min.   :262803276   Min.   :10662252   Min.   :10231   Min.   :16863  
##  1st Qu.:282395819   1st Qu.:12586911   1st Qu.:11348   1st Qu.:21701  
##  Median :297952772   Median :13839754   Median :13331   Median :25032  
##  Mean   :295487602   Mean   :13418226   Mean   :13710   Mean   :25205  
##  3rd Qu.:311023417   3rd Qu.:14180570   3rd Qu.:14134   3rd Qu.:28083  
##  Max.   :323127513   Max.   :15284300   Max.   :21230   Max.   :34650  
##     robbery       aggravated_assault    burglary         larceny       
##  Min.   : 94403   Min.   :358860     Min.   :207325   Min.   :1050058  
##  1st Qu.:105863   1st Qu.:400402     1st Qu.:288660   1st Qu.:1160498  
##  Median :108921   Median :442990     Median :295372   Median :1210490  
##  Mean   :116045   Mean   :445464     Mean   :294936   Mean   :1241126  
##  3rd Qu.:126438   3rd Qu.:478265     3rd Qu.:304564   3rd Qu.:1279616  
##  Max.   :171870   Max.   :568480     Max.   :386500   Max.   :1530200  
##  motor_vehicle_theft     arson       violent_crime    property_crime   
##  Min.   : 64566      Min.   : 8834   Min.   :480360   Min.   :1353283  
##  1st Qu.: 78934      1st Qu.:11519   1st Qu.:539047   1st Qu.:1606177  
##  Median :139978      Median :15834   Median :597236   Median :1630406  
##  Mean   :120921      Mean   :14733   Mean   :600415   Mean   :1671715  
##  3rd Qu.:148814      3rd Qu.:16759   3rd Qu.:626632   3rd Qu.:1677062  
##  Max.   :191900      Max.   :20000   Max.   :796250   Max.   :2128600  
##  other_assault        forgery           fraud         embezzlement  
##  Min.   :1078808   Min.   : 55333   Min.   :128531   Min.   :15200  
##  1st Qu.:1242966   1st Qu.: 72184   1st Qu.:173134   1st Qu.:16065  
##  Median :1293424   Median :107777   Median :281816   Median :17100  
##  Mean   :1259624   Mean   : 95762   Mean   :273572   Mean   :17620  
##  3rd Qu.:1310566   3rd Qu.:115451   3rd Qu.:343650   3rd Qu.:18852  
##  Max.   :1395800   Max.   :122300   Max.   :465000   Max.   :22381  
##  stolen_property    vandalism         weapons        prostitution   
##  Min.   : 88576   Min.   :191015   Min.   :137779   Min.   : 38306  
##  1st Qu.: 95519   1st Qu.:241417   1st Qu.:157338   1st Qu.: 58676  
##  Median :121936   Median :275064   Median :167153   Median : 78640  
##  Mean   :118191   Mean   :265275   Mean   :174857   Mean   : 74418  
##  3rd Qu.:128090   3rd Qu.:289934   3rd Qu.:190173   3rd Qu.: 87809  
##  Max.   :166500   Max.   :320900   Max.   :243900   Max.   :101600  
##  other_sex_offenses   drug_abuse         gambling     against_family  
##  Min.   : 51063     Min.   :1476100   Min.   : 3705   Min.   : 88748  
##  1st Qu.: 70076     1st Qu.:1533853   1st Qu.: 8900   1st Qu.:111938  
##  Median : 89082     Median :1576072   Median :10630   Median :127032  
##  Mean   : 81231     Mean   :1617127   Mean   :10736   Mean   :126231  
##  3rd Qu.: 93149     3rd Qu.:1674540   3rd Qu.:11916   3rd Qu.:143487  
##  Max.   :101900     Max.   :1889810   Max.   :21000   Max.   :155800  
##       dui           liquor_laws      drunkenness     disorderly_conduct
##  Min.   :1017808   Min.   :234899   Min.   :376433   Min.   :369733    
##  1st Qu.:1305198   1st Qu.:503684   1st Qu.:537818   1st Qu.:590412    
##  Median :1434117   Median :611335   Median :566726   Median :647346    
##  Mean   :1364988   Mean   :548852   Mean   :573149   Mean   :626903    
##  3rd Qu.:1461434   3rd Qu.:635714   3rd Qu.:632832   3rd Qu.:693571    
##  Max.   :1511300   Max.   :683124   Max.   :734800   Max.   :842600    
##     vagrancy         other           suspicion     curfew_loitering
##  Min.   :24851   Min.   :3218880   Min.   :  576   Min.   : 34176  
##  1st Qu.:27316   1st Qu.:3553687   1st Qu.: 1451   1st Qu.: 81406  
##  Median :29076   Median :3724251   Median : 3018   Median :139116  
##  Mean   :29909   Mean   :3668659   Mean   : 3909   Mean   :122666  
##  3rd Qu.:33056   3rd Qu.:3832337   3rd Qu.: 5562   3rd Qu.:152130  
##  Max.   :36471   Max.   :4022068   Max.   :12100   Max.   :187800

1. Aspects of Multivariate Analysis

Aplications of Multivariate Methods

  • Data reduction or structural simplification. The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.

  • Sorting and grouping. Groups of “similar” objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.

  • Investigation of the dependence among variables. The nature of the relation- ships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent on the others? If so, how?

  • Prediction. Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observations on the other variables.

  • Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions.

Descriptive Statistics

Sample Mean

Can be computed from the \(n\) measures on each of the \(p\) variables, so that, in general, there will be \(p\) sample means:

\[ \overline{X}_{k} = \frac{1}{n}\sum_{j=1}^n x_{jk} \text{ where } k=1,2,3,...,p \]

Sample Variance

A measure of spread defined for \(n\) measures on each of the \(p\). We have:

\[ s_k^2 = \frac{1}{n}\sum_{j=1}^n (x_{jk}-\overline{x}_k)^2 \text{ where } k=1,2,3,...,p \]

Sample Covariance

A measure of linear association between the \(n\) measurements

\[ s_{ik} = \frac{1}{n}\sum_{j=1}^n (x_{ji}-\overline{x}_i)(x_{jk}-\overline{x}_k) \text{ where }i=1,2,3,...,p \text{ and }k=1,2,3,...,p \]

Sample Correlation Coeficient

Correlation coefficients are indicators of the strength of the linear relationship between two different variables, \(x\) and \(y\). The sample correlation coefficient for the \(i\)th and \(k\)th variables is defined as

\[ r_{ik}=\frac{s_{ik}}{\sqrt{s_{ii}}\sqrt{s_{kk}}} \text{ where }i=1,2,3,...,p \text{ and }k=1,2,3,...,p \]

Graphical Techniques

For the graphical representation of the following techniques, see the application examples of:

  • Hepatitis Disease

  • FBI Arrest Data - Reported Number of Arrests by Crime

Scatter Plot

A scatter plot uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

Chi-Plot

The chi-plot is a graphical representation of the measures of local dependence with an easy interpretation and with more information regarding the usual measures of correlation.

Box-Plot

A box-plot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Bubble Plot or Bubble Charts

Bubble Plot are used when data needs a third dimension to provide richer information to viewers. A bubble plot is a relational chart designed to compare three variables.

Unlike other three-dimensional charts that process and represent data across three axes (usually x, y, and z), a bubble chart is represented on two axes (x and y), and the size of the bubble communicates the third, vital piece of information.

Stars Plot

Suppose each data unit consists of non-negative observations on \(p\geq2\) variables. In two dimensions, we can construct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle. The lengths of the rays represent the values of the variables. The ends of the rays can be connected with straight lines to form a star. Each star represents a multivariate observation, and the stars can be grouped according to their (subjective) similarities.

Chernoff Faces

People react to faces. Chernoff suggested representing \(p\)-dimensional observations as a two-dimensional face whose characteristics (face shape, mouth curvature, nose length, eye size, pupil position, and so forth) are determined by the measurements on the \(p\) variables.

As originally designed, Chernoff faces can handle up to 18 variables. The assignment of variables to facial features is done by the experimenter, and different choices produce different results. Some iteration is usually necessary before satisfactory representations are achieved.

3D Scatter Plot

A 3D scatterplot is a three-dimensional graph that is useful for investigating desirable response values and operating conditions.

Conditioning Plot

A conditioning plot is a scatter plot of two variables when conditioned on a third variable. The third variable is called the conditioning variable. This variable can have both values either continuous or categorical. In the continuous variable, we created subsets by dividing them into a smaller range of values. In categorical variables, the subsets are created based on different categories.

Hepatitis Disease - Multivariate Data Description and Graphical Techniques

R Libraries

library(dplyr)
library(scatterplot3d)
library(corrplot)
library(ggplot2)
library(GGally)

Mean

datatable(as.matrix(sapply(hepatitis[,15:19],function(x) mean(x, na.rm=TRUE))))

Variance

datatable(var(hepatitis[,15:19],use = "complete.obs"))

Covariance

datatable(cov(hepatitis[,15:19],use = "complete.obs"))

Correlation

c=cor(hepatitis[,15:19])
y=as.data.frame(c)
y[y==1]<-" "
y <- mutate_all(y, function(x) as.numeric(as.character(x)))


reactable(as.data.frame.array((y)),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array((y)))))
Correlation Plot
corrplot(cor(hepatitis[,15:19],use = "complete.obs"),method="number")

corrplot(cor(hepatitis[,15:19],use = "complete.obs"),method="circle")

Correlation Test: P-Values
hepatitis.rcorr = rcorr(as.matrix(hepatitis[,15:19]))
hepatitis.p=hepatitis.rcorr$P
reactable(as.data.frame.array(hepatitis.p),
          defaultColDef = colDef(
              style = highlight_min_max(as.data.frame.array(hepatitis.p))))
  • Since the values obtained from the correlation are very close to 0, it can be concluded that the values are not correlated with each other.

  • Since the values obtained from the Pearson correlation test are less than 0.05, all the null hypotheses are rejected.

Graphic Representation

Based on the values obtained in the correlation, it’s decided to explore the graphic representation between the SGOT and Bilirubin variables

Scatter Plots
ggplot(hepatitis, aes(x = Bilirubin, 
                      y = SGOT, 
                      color=Class)) +
     geom_point() +
     labs(title = "Bilirubin vs SGOT by Class")

ggplot(hepatitis, aes(x = Bilirubin, 
                      y = SGOT, 
                      color=Anorexia,
                      shape=Class)) +
     geom_point(
       
     ) +
     labs(title = "Bilirubin vs SGOT by Class and Anorexia")

ggplot(hepatitis, aes(x = Bilirubin, 
                      y = SGOT, 
                      color=Anorexia,
                      shape=Antivirals)) +
     geom_point(
       
     ) +
     labs(title = "Bilirubin vs SGOT by Anorexia and Antivirals")

ggplot(hepatitis, aes(x = Bilirubin, 
                      y = SGOT, 
                      color=Fatigue,
                      size=as.numeric(Age))) +
     geom_point(
       
     ) +
     labs(title = "Bilirubin vs SGOT by Fatigue and Age")

ggplot(hepatitis, aes(x = Bilirubin, 
                      y = SGOT, 
                      color=Class,
                      size=as.numeric(Age))) +
     geom_point(
       
     ) +
     labs(title = "Bilirubin vs SGOT by Class and Age")

Pearson Correlations Above the Diagonal
ggscatmat(hepatitis, columns=16:19, color="Class")

FBI Arrest Data - Multivariate Data Description and Graphical Techniques

R Libraries

library(dplyr)
library(MVA)
library(aplpack)
library(scatterplot3d)
library(corrplot)
library(ggplot2)

Mean

datatable(as.matrix(sapply(FBI,function(x) mean(x, na.rm=TRUE))))

Variance

reactable(var(FBI,use = "complete.obs"))

Covariance

reactable(cov(FBI,use = "complete.obs"))

Correlation

c=cor(FBI)
y=as.data.frame(c)
y[y==1]<-" "
y <- mutate_all(y, function(x) as.numeric(as.character(x)))


reactable(as.data.frame.array((y)),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array((y)))))
Correlation Plot
corrplot(cor(FBI,use = "complete.obs"),method="circle")

Correlation Test: P-Values
FBI.rcorr = rcorr(as.matrix(FBI))
FBI.p=FBI.rcorr$P

reactable(as.data.frame.array(FBI.p),
          defaultColDef = colDef(
              style = highlight_min_max(as.data.frame.array(FBI.p))))
  • As can be seen in the correlation plot, it can be seen that almost all the variables are correlated with each other.

  • The values highlighted on the correlation matrix in green, represent the variables that are most correlated with each other.

  • The values highlighted on the correlation test table in green, represent the highest p-values of variables that are most correlated with each other. This p-values non-reject the nule hypothesis.

Graphic Representation

Considering the attributes in the correlation matrix that are most correlated positively with each other (values greater than 0.95):

  • aggravated_assault with violent_crime
  • violent_crime with homicide
  • violent_crime with stolen_property
  • aggravated_assault with fraud
  • fraud with arson
  • homicide with weapons
  • forgery with other_sex_offenses
  • prostitution with drunkeness
  • prostitution with curfew_loite_ring
  • drug_abuse with embezzlement
  • drug_abuse with vagrancy
  • dui with liquor_laws

Some of the mentioned attributes are taken to perform several graphs.

Scatter Plots
ggplot(FBI, aes(x=violent_crime, y=aggravated_assault)) + geom_point()+labs(title = "Violent Crime vs Aggravated Assault")

ggplot(FBI, aes(x=violent_crime, y=aggravated_assault,label=rownames(FBI))) + geom_text()+labs(title = "Violent Crime vs Aggravated Assault using Row Names")

ggplot(FBI, aes(x=violent_crime, y=homicide)) + geom_point()+labs(title = "Violent Crime vs Homicide")

ggplot(FBI, aes(x=violent_crime, y=homicide,label=rownames(FBI))) + geom_text()+labs(title = "Violent Crime vs Homicide using Row Names")

Bubble Charts
ggplot(FBI, aes(x=violent_crime, y=aggravated_assault,size=homicide)) + geom_point(alpha=0.5)+scale_size(range=c(.1,15))+labs(title = "Violent Crime vs Aggravated Assault and Homicide")

ggplot(FBI, aes(x=violent_crime, y=aggravated_assault,size=fraud)) + geom_point(alpha=0.5)+scale_size(range=c(.1,15))+labs(title = "Violent Crime vs Aggravated Assault and Fraud")

Chernoff Faces
df<-data.frame(FBI$aggravated_assault,FBI$violent_crime,FBI$homicide,FBI$stolen_property,FBI$fraud,FBI$arson,FBI$prostitution,FBI$other_sex_offenses,FBI$drunkenness,FBI$dui,FBI$liquor_laws,FBI$drug_abuse,FBI$curfew_loitering,FBI$embezzlement,FBI$vagrancy)

faces(df, main="United States FBI Arrest Data",face.type=0, print.info=TRUE,labels = rownames(FBI))

## effect of variables:
##  modified item       Var                     
##  "height of face   " "FBI.aggravated_assault"
##  "width of face    " "FBI.violent_crime"     
##  "structure of face" "FBI.homicide"          
##  "height of mouth  " "FBI.stolen_property"   
##  "width of mouth   " "FBI.fraud"             
##  "smiling          " "FBI.arson"             
##  "height of eyes   " "FBI.prostitution"      
##  "width of eyes    " "FBI.other_sex_offenses"
##  "height of hair   " "FBI.drunkenness"       
##  "width of hair   "  "FBI.dui"               
##  "style of hair   "  "FBI.liquor_laws"       
##  "height of nose  "  "FBI.drug_abuse"        
##  "width of nose   "  "FBI.curfew_loitering"  
##  "width of ear    "  "FBI.embezzlement"      
##  "height of ear   "  "FBI.vagrancy"

2. The Multivariate Normal Distribution

The Plausibility of \(\mu_0\) as a Value for a Normal Population Mean

Recall the univariate theory for determining whether a specific value \(\mu_0\) is a plausible value for the population mean \(\mu\)

  • View of Hypothesis Testing: Can be formulated as:

    • \(H_0:\mu=\mu_0\)

    • \(H_1:\mu\not= \mu_0\)

Where \(H_0\) is the null hypothesis and \(H_1\) the alternative hypothesis

  • Be a random sample from a normal population \(X_1,…,X_n\), the appropiate test has a student’s \(t\) distribution with n-1 degrees of freedom as:

    \[t=\frac{(\overline{X}-\mu_0)}{\frac{s}{\sqrt{n}}}\] where:

\[ \overline{X}=\frac{1}{n}\sum_{j=1}^{n} X_j \text{ and } s^2=\frac{1}{n-1}\sum_{j=1}^n (X_j-\overline{X})^2 \]

  • Rejecting \(H_0\) when \(\mid t\mid\) is large is equivalent to rejecting \(H_0\) if \(t^2\) is large.

  • Be \(t^2\) the square distance from the sample mean \(\overline{X}\) to the test value \(\mu_0\):

    \[ t^2=\frac{(\overline{X}-\mu_0)^2}{\frac{s^2}{n}}=n(\overline{X}-\mu_0)^2(s^2)^{-1} \]

    With the observation of \(\overline{X}\) and \(s^2\), the test becomes: Reject \(H_0\) in favor of \(H_1\) at significance level \(\alpha\) if

\[ n(\overline{X}-\mu_0)^2(s^2)^{-1}>t^2_{n-1}{\alpha}/{2} \]

where \(t_{n-1}(\alpha/2)\) denotes the upper 100\((\alpha/2)\)th percentile of the \(t\)-distribution with \(n-1\) degrees of freedom.

  • If \(H_0\) is not rejected, we conclude that \(\mu_0\) is a plausible value for the normal population mean.

The Statistic Hotelling’s \(T^2\)

A generalization of the square distance between the sample mean \(\overline{X}\) to the test value \(\mu_0\) is:

\[ T^2=\left(\overline{X}-\mu_0\right)'\left(\frac{1}{n}S\right)^{-1}\left(\overline{X}-\mu_0\right)=n\left(\overline{X}-\mu_0\right)'S^{-1}\left(\overline{X}-\mu_0\right) \]

where:

\[ \overline{X}_{(p\times 1)} = \frac{1}{n}\sum_{j=1}^n X_j, S_{(p\times p)}=\frac{1}{n-1}\sum_{j=1}^n\left(X_j-\overline{X}\right)\left(X_j-\overline{X}\right)' \text{ and } \mu_0=\begin{bmatrix} \mu_{10} \\ \mu_{20} \\ \vdots \\ \mu_{p0} \end{bmatrix} \]

The hypothesis \(H_0:\mu=\mu_0\) is rejected if the observed statistical distance \(T^2\) is too large (i.e. if \(\overline{x}\) is too far from \(\mu_0\) )

The special tables of \(T^2\) turns out when the percentage points are not required for formal test of hypothesis. It’s true because:

\[T^2 \text{is distributed as }\frac{(n-1)p}{n-p}F_{p,n-p}\] where \(F_{p,n-p}\) denotes a random variable with an \(F-\) distribution with \(p\) and \(n-1\) degrees of freedom.

Summarize

Let \(\mathbf{X}_1…,\mathbf{X}_n\) be a random sample from an \(N_p(\mu,\Sigma)\) population. Then with \(\mathbf{\overline{X}}=\frac{1}{n}\sum_{j=1}^n\mathbf{X}_j\) and \(S=\frac{1}{n-1}\sum_{j=1}^n(X_j-\overline{X})(X_j-\overline{X})'\),

\[\alpha=P\left[T^2>\frac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\right]\]

\[ =P\left[n\left(\overline{X}-\mu\right)'S^{-1}\left(\overline{X}-\mu\right)>\frac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\right] \]

whatever the true \(\mu\) and \(\Sigma\). Here \(F_{p,n-p}(\alpha)\) is the upper (100\(\alpha\))th percentile of the \(F_{p,n-p}\) distribution.

Hepatitis Disease - Hotelling’s \(T^2\) Test

R Libraries

library(Hotelling) 

Hotelling’s \(T^2\) Test with all Attributes by Class

t2testsparr <- hotelling.test(hepatitis$Bilirubin + hepatitis$`ALK Phosphate` + hepatitis$SGOT + 
                                  hepatitis$Albumin + hepatitis$Protime ~ hepatitis$Class)

t2testsparr
## Test stat:  71.979 
## Numerator df:  5 
## Denominator df:  148 
## P-value:  3.243e-11
  • The Hotteling \(T^2\) value was statistically significant (i.e. there is evidence about mean difference with live and death people considering the 5 numeric attributes values)

    • \(T^2=71.979\) with 5 and 148 degrees of freedom.
    • \(p\)-value=\(3.243e-11\) The null hypothesis is rejected and the alternative is non rejected.

Hotelling’s \(T^2\) Test with all Attributes by Sex

t2testsparr <- hotelling.test(hepatitis$Bilirubin + hepatitis$`ALK Phosphate` + hepatitis$SGOT + 
                                  hepatitis$Albumin + hepatitis$Protime ~ hepatitis$Sex)

t2testsparr
## Test stat:  2.5208 
## Numerator df:  5 
## Denominator df:  148 
## P-value:  0.7827
  • The Hotteling \(T^2\) value was not statistically significant (i.e. there is evidence about mean difference with Male and Female people considering the 5 numeric attributes values)

    • \(T^2=2.5208\) with 5 and 148 degrees of freedom.
    • \(p\)-value=\(0.7827\) The null hypothesis is non rejected and the alternative is rejected.

3. Wishart Distribution - Definition

The Wishart Distribution

The Wishart distribution is a family of distributions for symmetric positive definite matrices. Let \(\mathbf{X_1,X_2,…,X_n}\) be independent \(N_p(\mathbf{0},\Sigma)\) and form a \(p × n\) data matrix \(X = [X_1,...,X_n]\). The distribution of a \(p × p\) random matrix \(\mathbf{M = XX}′=\sum_{i=1}^n \mathbf{X}_i\mathbf{X}_i'\) is said to have the Wishart distribution.

Definition

The random matrix \(\mathbf{M}(p×p) = \sum_{i=1}^n \mathbf{X}_i\mathbf{X}_i'\) has the Wishart distribution with n degrees of freedom and covariance matrix \(\Sigma\) and is denoted by \(\mathbf{M} ∼ W_p(n, \Sigma)\). For n ≥ p, the probability density function of M is

\[ f(\mathbf{X}) = \frac{1}{2^{np/2}\Gamma_p \left(\frac{n}{2}\right) \left| \Sigma \right|^{n/2}} \left| \mathbf{M} \right|^{(n-1-p)/2}\exp \left( -\frac{1}{2}\text{trace}(\Sigma^{-1}\mathbf{M}) \right) \]

with respect to Lebesque measure on the cone of symmetric positive definite matrices. Here, \(\Gamma_p(\alpha)\) is the multivariate gamma function.

The precise form of the density is rarely used. Two exceptions are:

  • In Bayesian computation, the Wishart distribution is often used as a conjugate prior for the inverse of normal covariance matrix.

  • When symmetric positive definite matrices are the random elements of interest in diffusion tensor study.

The Wishart distribution is a multivariate extension of \(\chi^2\) distribution. In particular, if \(\mathbf{M} ∼ W_1(n, \sigma_2)\), then \(\mathbf{M}/σ^2 ∼ \chi^2\)n. For a special case \(Σ = \mathbb{I}, W_p(n, \mathbb{I})\) is called the standard Wishart distribution.

4. Principal Component Analysis

What is Principal Component Analysis (PCA)

PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.

Explanation of Principal Component Analysis (Steps)

1. Standarization

The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables.

Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

2. Covariance Matrix Computation

The covariance matrix is a \(p\times p\) symmetric matrix (where \(p\) is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables.

The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them.

3. Compute The Eigenvectors And Eigenvalues of the Covariance Matrix to Identify the Principal Components

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.

Organizing information in principal components will allow you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.

An important thing to realize here is that the principal components are less interpretable and don’t have any real meaning since they are constructed as linear combinations of the initial variables.

Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance, that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has.

4. Feature Vector

What we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.

5. Final Step: Recast the Data Along the Principal Component Axes

The aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis). This can be done by multiplying the transpose of the original data set by the transpose of the feature vector.

Hepatitis Desease - PCA

R Libraries

library(Hmisc)
library(psych)
library(bcv)

Correlation Matrix

cor(hepatitis[,15:19])
##                Bilirubin ALK Phosphate       SGOT    Albumin    Protime
## Bilirubin      1.0000000     0.1325704  0.2352884 -0.3701577 -0.2220303
## ALK Phosphate  0.1325704     1.0000000  0.1831043 -0.3358635 -0.1856398
## SGOT           0.2352884     0.1831043  1.0000000 -0.1065261 -0.1375747
## Albumin       -0.3701577    -0.3358635 -0.1065261  1.0000000  0.2964787
## Protime       -0.2220303    -0.1856398 -0.1375747  0.2964787  1.0000000

Principal Components Based in Correlation Matrix

heppca<- prcomp(hepatitis[,15:19],scale=TRUE)
summary(heppca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5
## Standard deviation     1.3797 0.9633 0.9317 0.8820 0.7228
## Proportion of Variance 0.3807 0.1856 0.1736 0.1556 0.1045
## Cumulative Proportion  0.3807 0.5663 0.7399 0.8955 1.0000
Selection of Principal Components - Importance of Components
  • Standard Deviation: when values are greater than 1, 1 component are selected (PC1)
  • Proportion of Variance: when the percentage is greater than 70% of the total variation (none)

Components Variances Bar Plot

plot(heppca, xlab="Components")

Components Variances Scree Plot

plot(heppca, type="lines")

Components Selection

svd.gabriel
numcomp<-cv.svd.gabriel(cor(hepatitis[,15:19]), krow = 2, kcol = 2,  maxrank = 2)
numcomp
## 
## Call:
## cv.svd.gabriel(x = cor(hepatitis[, 15:19]), krow = 2, kcol = 2,     maxrank = 2)
## 
##   Rank    MSEP       SE
##  ----------------------
##      0  0.2412  0.07661 *+ 
##      1  0.3435  0.16317
##      2  4.0134  3.33413

From the above result, 1 component is selected

svd.wold
numcomp1<-cv.svd.wold(cor(hepatitis[,15:19]), k=5, maxrank=5)
numcomp1
## 
## Call:
## cv.svd.wold(x = cor(hepatitis[, 15:19]), k = 5, maxrank = 5)
## 
##   Rank    MSEP       SE
##  ----------------------
##      0  0.2447  0.08581 *+ 
##      1  0.2628  0.04207
##      2  0.7866  0.19539
##      3  0.7994  0.26247
##      4  0.5368  0.10790
##      5  0.3441  0.09534

From the above result, 1 component is selected

Component 1 and Component 2 Biplot of Variables and Individuals
biplot(heppca,cex=c(0.4,0.5),expand = 1)

  • Most of the people has their values with respect to the variables with more weight in each component are closer to their respective averages.

Correlation vs Principal Components Bar Plot

hep.cpr <- heppca$rotation %*% diag(heppca$sdev)
barplot(t(hep.cpr), beside = TRUE, ylim = c(-1, 1))

Principal Component 1 vs Bilirubin Scatter Plot

par(mfrow=c(1,2))
plot(heppca$x[,1],hepatitis$Bilirubin,xlab="PC1")
plot(heppca$x[,2],hepatitis$Bilirubin,xlab="PC2")

Linear Regression Using 2 Hold Components to Explain the Bilirubin Attribute

modlin<- lm(hepatitis$Bilirubin~heppca$x[,1]+heppca$x[,2])
summary(modlin)
## 
## Call:
## lm(formula = hepatitis$Bilirubin ~ heppca$x[, 1] + heppca$x[, 
##     2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4345 -0.4252 -0.0635  0.2554  4.2544 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.41364    0.07066  20.007  < 2e-16 ***
## heppca$x[, 1]  0.56990    0.05138  11.092  < 2e-16 ***
## heppca$x[, 2] -0.23219    0.07359  -3.155  0.00194 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8768 on 151 degrees of freedom
## Multiple R-squared:  0.4683, Adjusted R-squared:  0.4612 
## F-statistic: 66.49 on 2 and 151 DF,  p-value: < 2.2e-16

Since the \(R^2=0.4683\), the linear regression model is rejected.

FBI Arrest Data - PCA

For simplicity is considering the attributes in the correlation matrix that are most correlated positively with each other (values greater than 0.95):

  • aggravated_assault with violent_crime
  • violent_crime with homicide
  • violent_crime with stolen_property
  • aggravated_assault with fraud
  • fraud with arson
  • homicide with weapons
  • forgery with other_sex_offenses
  • prostitution with drunkeness
  • prostitution with curfew_loite_ring
  • drug_abuse with embezzlement
  • drug_abuse with vagrancy
  • dui with liquor_laws

R Libraries

library(Hmisc)
library(psych)
library(bcv)

Bartlett’s Test

Matrix of Correlations and P-Values
Correlation
df<-data.frame(FBI$aggravated_assault,FBI$violent_crime,FBI$homicide,FBI$stolen_property,FBI$fraud,FBI$arson,FBI$prostitution,FBI$other_sex_offenses,FBI$drunkenness,FBI$dui,FBI$liquor_laws,FBI$drug_abuse,FBI$curfew_loitering,FBI$embezzlement,FBI$vagrancy,row.names = rownames(FBI))

c=cor(df)
y=as.data.frame(c)
y[y==1]<-" "
y <- mutate_all(y, function(x) as.numeric(as.character(x)))


reactable(as.data.frame.array((y)),
defaultColDef = colDef(
style = highlight_min_max(as.data.frame.array((y)))))
Correlation Test: P-Values
FBI.rcorr = rcorr(as.matrix(df))
FBI.p=FBI.rcorr$P

reactable(as.data.frame.array(FBI.p),
          defaultColDef = colDef(
              style = highlight_min_max(as.data.frame.array(FBI.p))))
Bartlett’s test
cortest.bartlett(df)
## R was not square, finding R from data
## $chisq
## [1] 689.2364
## 
## $p.value
## [1] 1.070471e-86
## 
## $df
## [1] 105

Since the p-value is less than 0.05, the null hypothesis is rejected.

Kaiser-Meyer-Olkin (KMO) Test

The Kaiser-Meyer-Olkin (KMO) Test is a test that is used to decide whether our samples are suitable for conducting factor analysis. Factor analysis in statistics is about identifying underlying factors or causes that can be used to represent the relationship between two or more variables.

KMO Test

KMO(df)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = df)
## Overall MSA =  0.73
## MSA for each item = 
## FBI.aggravated_assault      FBI.violent_crime           FBI.homicide 
##                   0.70                   0.65                   0.78 
##    FBI.stolen_property              FBI.fraud              FBI.arson 
##                   0.83                   0.74                   0.87 
##       FBI.prostitution FBI.other_sex_offenses        FBI.drunkenness 
##                   0.76                   0.75                   0.72 
##                FBI.dui        FBI.liquor_laws         FBI.drug_abuse 
##                   0.73                   0.75                   0.50 
##   FBI.curfew_loitering       FBI.embezzlement           FBI.vagrancy 
##                   0.72                   0.67                   0.46

Since the KMO Overall Value is 0.73 and when the value is above 0.5. This indicate that the sampling is adequate and is well suited for PCA.

Principal Component Analysis (Using the Covariance Matrix)

Variance
reactable(var(df,use = "complete.obs"))
Principal Components
FBIPCA<- prcomp(df,scale=T)
Summary of Principal Components
summary(FBIPCA)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     3.2888 1.6792 0.71296 0.63426 0.44638 0.30214 0.2192
## Proportion of Variance 0.7211 0.1880 0.03389 0.02682 0.01328 0.00609 0.0032
## Cumulative Proportion  0.7211 0.9091 0.94296 0.96978 0.98306 0.98915 0.9923
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.19477 0.16915 0.15033 0.11660 0.08758 0.05124 0.03620
## Proportion of Variance 0.00253 0.00191 0.00151 0.00091 0.00051 0.00018 0.00009
## Cumulative Proportion  0.99488 0.99679 0.99829 0.99920 0.99971 0.99989 0.99997
##                           PC15
## Standard deviation     0.01957
## Proportion of Variance 0.00003
## Cumulative Proportion  1.00000
Selection of Principal Components - Importance of Components
  • Standard Deviation: when values are greater than 1, 2 components are selected (PC1,PC2)
  • Proportion of Variance: when the percentage is greater than 70% of the total variation (PC1)
Principal Components Variance Bar Plot
plot(FBIPCA)

Principal Components Variance Scree Plot
plot(FBIPCA,type="lines")

Biplots
Principal Component 1 and Principal Component 2
biplot(FBIPCA,cex=c(0.4,0.5),expand = 1)

  • The average year is 2003 their values with respect to the variables with more weight in each component are closer to their respective averages.

  • Since 2013, there are the lowest values with respect to the variables.

  • It can be said that crimes have decreased over the years.

Components Selection

svd.gabriel
cv.svd.gabriel(x = cov(df), krow = 15, kcol = 15, maxrank = 14)
## 
## Call:
## cv.svd.gabriel(x = cov(df), krow = 15, kcol = 15, maxrank = 14)
## 
##   Rank       MSEP         SE
##  ---------------------------
##      0  2.080e+19  4.025e+18
##      1  1.790e+18  8.872e+17
##      2  9.589e+17  4.244e+17
##      3  6.019e+17  3.040e+17
##      4  9.110e+17  3.515e+17
##      5  5.098e+17  2.348e+17
##      6  1.555e+17  4.672e+16
##      7  1.779e+17  5.025e+16
##      8  3.242e+17  1.244e+17
##      9  3.956e+17  1.848e+17
##     10  6.420e+16  1.890e+16
##     11  2.020e+16  5.785e+15 *+ 
##     12  2.465e+16  6.877e+15
##     13  1.153e+17  3.992e+16
##     14  3.255e+17  1.631e+17

From the above result, 12 components are selected

svd.wold
cv.svd.wold(cov(df), k=15, maxrank=14)
## 
## Call:
## cv.svd.wold(x = cov(df), k = 15, maxrank = 14)
## 
##   Rank       MSEP         SE
##  ---------------------------
##      0  2.080e+19  3.320e+18
##      1  1.805e+18  9.038e+17
##      2  8.569e+17  3.359e+17 *+ 
##      3  4.197e+18  1.372e+18
##      4  8.544e+18  1.738e+18
##      5  1.035e+19  1.543e+18
##      6  1.193e+19  1.989e+18
##      7  1.229e+19  1.982e+18
##      8  1.242e+19  1.994e+18
##      9  1.249e+19  1.993e+18
##     10  1.250e+19  1.992e+18
##     11  1.251e+19  1.992e+18
##     12  1.251e+19  1.992e+18
##     13  1.251e+19  1.992e+18
##     14  1.251e+19  1.992e+18

From the above result, 3 components are selected

5. References