R

DPLYR를 이용한 Cars93처리 ①

한번해보즈아 2021. 4. 4. 19:23

MASS에 있는 Cars93 데이터를 dplyr package를 이용하여 처리해보도록 하겠습니다.

 

Cars93의 데이터 형태는 다음과 같습니다.

 

Cars93 자료

기본적으로 install.packages("패키지이름")을 이용하여 설치한다음 library를 이용하여 불러온다음 데이터의 기본적인 구조를 파악합니다.

> library(dplyr)
> library(MASS)

> df <- Cars93
> str(df) #자료의 구조파악
'data.frame':	93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
 
 > dim(df)  #행,열 수 파악
[1] 93 27

> distinct(df, Origin,Type)
      Type  Origin
1    Small non-USA
2  Midsize non-USA
3  Compact non-USA
4  Midsize     USA
5    Large     USA
6  Compact     USA
7   Sporty     USA
8      Van     USA
9    Small     USA
10  Sporty non-USA
11     Van non-USA

1.자료추출하기 (sample_n, sample_frac)이용

dplyr를 이용하여 자료를 추출할 수있는데 sample_n을 이용하면 원하는 개수 만큼 추출할수있고 sample_frac을 이용하면 원하는 비율만큼 추출할수 있습니다.

 

> sample_n(df,10)  #자료중 랜덤으로 10개 추출 //
   Manufacturer          Model    Type Min.Price Price Max.Price MPG.city MPG.highway            AirBags
1     Chevrolet       Cavalier Compact       8.5  13.4      18.3       25          36               None
2           Geo          Metro   Small       6.7   8.4      10.0       46          50               None
3         Buick     Roadmaster   Large      22.6  23.7      24.9       16          25        Driver only
4         Buick        Century Midsize      14.2  15.7      17.3       22          31        Driver only
5          Ford Crown_Victoria   Large      20.1  20.9      21.7       18          26        Driver only
6           Geo          Storm  Sporty      11.5  12.5      13.5       30          36        Driver only
7        Nissan         Maxima Midsize      21.0  21.5      22.0       21          26        Driver only
8     Chevrolet       Corvette  Sporty      34.6  38.0      41.5       17          25        Driver only
9      Chrylser       Concorde   Large      18.4  18.4      18.4       20          28 Driver & Passenger
10        Mazda            323   Small       7.4   8.3       9.1       29          37               None
   DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
1       Front         4        2.2        110 5200         2380             Yes               15.2          5
2       Front         3        1.0         55 5700         3755             Yes               10.6          4
3        Rear         6        5.7        180 4000         1320              No               23.0          6
4       Front         4        2.2        110 5200         2565              No               16.4          6
5        Rear         8        4.6        190 4200         1415              No               20.0          6
6       Front         4        1.6         90 5400         3250             Yes               12.4          4
7       Front         6        3.0        160 5200         2045              No               18.5          5
8        Rear         8        5.7        300 5000         1450             Yes               20.0          2
9       Front         6        3.3        153 5300         1990              No               18.0          6
10      Front         4        1.6         82 5000         2370             Yes               13.2          4
   Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin                Make
1     182       101    66          38           25.0           13   2490     USA  Chevrolet Cavalier
2     151        93    63          34           27.5           10   1695 non-USA           Geo Metro
3     216       116    78          45           30.5           21   4105     USA    Buick Roadmaster
4     189       105    69          41           28.0           16   2880     USA       Buick Century
5     212       114    78          43           30.0           21   3950     USA Ford Crown_Victoria
6     164        97    67          37           24.5           11   2475 non-USA           Geo Storm
7     188       104    69          41           28.5           14   3200 non-USA       Nissan Maxima
8     179        96    74          43             NA           NA   3380     USA  Chevrolet Corvette
9     203       113    74          40           31.0           15   3515     USA   Chrylser Concorde
10    164        97    66          34           27.0           16   2325 non-USA           Mazda 323


> sample_n(df[,1:3],10) #자료중 랜덤으로 10개추출하되 1~3열만 추출
    Manufacturer         Model    Type
1  Mercedes-Benz          190E Compact
2       Chrylser      Concorde   Large
3      Chevrolet         Astro     Van
4          Dodge        Shadow   Small
5        Mercury         Capri  Sporty
6     Oldsmobile       Achieva Compact
7     Oldsmobile Cutlass_Ciera Midsize
8        Hyundai         Excel   Small
9           Ford         Probe  Sporty
10        Suzuki         Swift   Small


> sample_n(df[,1:3],20,replace=TRUE) #bootstrap으로 복원추출하는 경우 //defalut값은 비복원(FALSE)
   Manufacturer        Model    Type
1     Chevrolet   Lumina_APV     Van
2      Plymouth        Laser  Sporty
3       Pontiac   Grand_Prix Midsize
4          Ford       Taurus Midsize
5      Chrylser     Concorde   Large
6     Chevrolet      Corsica Compact
7        Nissan       Altima Compact
8        Toyota       Previa     Van
9        Nissan        Quest     Van
10      Pontiac      Sunbird Compact
11      Pontiac      Sunbird Compact
12   Oldsmobile Eighty-Eight   Large
13        Acura       Legend Midsize
14         Ford       Escort   Small
15   Volkswagen      Eurovan     Van
16        Volvo          240 Compact
17          Geo        Metro   Small
18   Volkswagen       Passat Compact
19        Dodge       Spirit Compact
20        Acura       Legend Midsize


> distinct(sample_n(df[,1:3],20,replace=TRUE)) #분명 20개를 추출하였는데 중복때문에 20개이하로나옴
   Manufacturer      Model    Type
1          Audi         90 Compact
2          Audi        100 Midsize
3         Buick    Century Midsize
4    Volkswagen    Eurovan     Van
5          Ford   Aerostar     Van
6      Chrysler    LeBaron Compact
7      Plymouth      Laser  Sporty
8         Mazda    Protege   Small
9         Mazda        MPV     Van
10       Nissan     Altima Compact
11       Nissan     Maxima Midsize
12        Buick Roadmaster   Large
13         Ford      Probe  Sporty
14        Lexus      SC300 Midsize
15    Chevrolet Lumina_APV     Van
16        Dodge    Caravan     Van
17        Dodge    Stealth  Sporty
18        Dodge       Colt   Small


> sample_frac(df,0.1) # 랜덤으로 0.1 만큼 추출 -> 9개추출됨
  Manufacturer        Model    Type Min.Price Price Max.Price MPG.city MPG.highway            AirBags
1        Acura       Legend Midsize      29.2  33.9      38.7       18          25 Driver & Passenger
2    Chevrolet      Caprice   Large      18.0  18.8      19.6       17          26        Driver only
3          Geo        Metro   Small       6.7   8.4      10.0       46          50               None
4        Dodge       Shadow   Small       8.4  11.3      14.2       23          29        Driver only
5          BMW         535i Midsize      23.7  30.0      36.2       22          30        Driver only
6      Lincoln  Continental Midsize      33.3  34.3      35.3       17          26 Driver & Passenger
7   Oldsmobile Eighty-Eight   Large      19.5  20.7      21.9       19          28        Driver only
8    Chevrolet       Lumina Midsize      13.4  15.9      18.4       21          29               None
9        Mazda          MPV     Van      16.6  19.1      21.7       18          24               None
  DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
1      Front         6        3.2        200 5500         2335             Yes               18.0          5
2       Rear         8        5.0        170 4200         1350              No               23.0          6
3      Front         3        1.0         55 5700         3755             Yes               10.6          4
4      Front         4        2.2         93 4800         2595             Yes               14.0          5
5       Rear         4        3.5        208 5700         2545             Yes               21.1          4
6      Front         6        3.8        160 4400         1835              No               18.4          6
7      Front         6        3.8        170 4800         1570              No               18.0          6
8      Front         4        2.2        110 5200         2595              No               16.5          6
9        4WD         6        3.0        155 5000         2240              No               19.6          7
  Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin                    Make
1    195       115    71          38           30.0           15   3560 non-USA            Acura Legend
2    214       116    77          42           29.5           20   3910     USA       Chevrolet Caprice
3    151        93    63          34           27.5           10   1695 non-USA               Geo Metro
4    172        97    67          38           26.5           13   2670     USA            Dodge Shadow
5    186       109    69          39           27.0           13   3640 non-USA                BMW 535i
6    205       109    73          42           30.0           19   3695     USA     Lincoln Continental
7    201       111    74          42           31.5           17   3470     USA Oldsmobile Eighty-Eight
8    198       108    71          40           28.5           16   3195     USA        Chevrolet Lumina
9    190       110    72          39           27.5           NA   3735 non-USA               Mazda MPV


> df[,c("Origin","RPM","Model")] %>% group_by(Origin) %>% sample_n(5) #원하는 colum을 Origin기준 그룹으로 각 5개씩 랜덤 추출
# A tibble: 10 x 3
# Groups:   Origin [2]
   Origin    RPM Model     
   <fct>   <int> <fct>     
 1 USA      5000 Festiva   
 2 USA      5200 Sunbird   
 3 USA      5600 LeMans    
 4 USA      4000 Roadmaster
 5 USA      6000 Stealth   
 6 non-USA  5600 Altima    
 7 non-USA  5000 Previa    
 8 non-USA  5200 Loyale    
 9 non-USA  5000 MPV       
10 non-USA  5900 Civic    

2.새로운 변수 추가하기(새로운 변수생성 mutate, transmute) 

mutate는 기존의 열에 신규변수를추가하지만 transmute는 기존의 열을 날리고 신규변수만 남습니다.

> df_1 <- df[1:10,c("Model", "Min.Price", "Max.Price")] #순서대로 1행부터 10행, 3개 colum 추출


> df_1 %>%  mutate(df_1,
+                Price_range=Max.Price-Min.Price,
+                Price_mean=(Max.Price+Min.Price)/2) #가격의 범위와 평균구하기
        Model Min.Price Max.Price Price_range Price_mean
1     Integra      12.9      18.8         5.9      15.85
2      Legend      29.2      38.7         9.5      33.95
3          90      25.9      32.3         6.4      29.10
4         100      30.8      44.6        13.8      37.70
5        535i      23.7      36.2        12.5      29.95
6     Century      14.2      17.3         3.1      15.75
7     LeSabre      19.9      21.7         1.8      20.80
8  Roadmaster      22.6      24.9         2.3      23.75
9     Riviera      26.3      26.3         0.0      26.30
10    DeVille      33.0      36.3         3.3      34.65

> df_2 <- transmute(df_1,
+                Price_range=Max.Price-Min.Price,
+                Price_mean=(Max.Price+Min.Price)/2) 


> df_2
   Price_range Price_mean
1          5.9      15.85
2          9.5      33.95
3          6.4      29.10
4         13.8      37.70
5         12.5      29.95
6          3.1      15.75
7          1.8      20.80
8          2.3      23.75
9          0.0      26.30
10         3.3      34.65

3. 통계값 계산 (summarise)

 

> ##통계값 계산 (summarise)
> summary(df$Price,na.rm=TRUE) #방법 1 dplyr 사용x
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.40   12.20   17.70   19.51   23.30   61.90 
   
   
> summarise(df,
+           Price_mean=mean(Price,na.rm=TRUE),
+           Price_median=median(Price,na.rm=TRUE),
+           Price_sd=sd(Price,na.rm=TRUE),
+           Price_min=min(Price,na.rm=TRUE),
+           Price_max=max(Price,na.rm=TRUE),
+           Price_mean=mean(Price,na.rm=TRUE),
+           Price_IQR=IQR(Price,na.rm=TRUE),
+           Price_sum=sum(Price,na.rm=TRUE),
+           tot_cnt=n(),
+           Manufacturer_dist_cnt=n_distinct(Manufacturer),
+           First_obs=first(Manufacturer),
+           Last_obs=last(Manufacturer),
+           Nth_3th_obs=nth(Manufacturer,5))  #방법 2 dplyr 사용o
  Price_mean Price_median Price_sd Price_min Price_max Price_IQR Price_sum tot_cnt Manufacturer_dist_cnt
1   19.50968         17.7  9.65943       7.4      61.9      11.1    1814.4      93                    32
  First_obs Last_obs Nth_3th_obs
1     Acura    Volvo         BMW


> #그룹별 통계량 계산
> df %>% group_by(Type) %>% 
+   summarise(tot_cnt=n(),
+             Manufacturer_dist_cnt=n_distinct(Manufacturer),
+             Price_mean=mean(Price,na.rm=TRUE),
+             Price_sd=sd(Price,na.rm=TRUE))
# A tibble: 6 x 5
  Type    tot_cnt Manufacturer_dist_cnt Price_mean Price_sd
* <fct>     <int>                 <int>      <dbl>    <dbl>
1 Compact      16                    15       18.2     6.69
2 Large        11                    10       24.3     6.34
3 Midsize      22                    20       27.2    12.3 
4 Small        21                    16       10.2     1.95
5 Sporty       14                    12       19.4     7.97
6 Van           9                     8       19.1     1.88

여기서 tot_cnt와 Manufacturer_dist_cnt의 개수가 다른이유는 filter를 이용하여 알수있는데 자세한filter사용방법은 다음시간에 다루기로 합니다.

Type이 Compact인 데이터 추출결과 총 16개가 나오는데 2,3행을 보면 Manufacturer과 Type가 겹쳐 15개로 계산된걸 알수있습니다.

> df %>% group_by(Type,Manufacturer) %>% filter(Type=="Compact")
# A tibble: 16 x 27
# Groups:   Type, Manufacturer [15]
   Manufacturer Model Type  Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain Cylinders
   <fct>        <fct> <fct>     <dbl> <dbl>     <dbl>    <int>       <int> <fct>   <fct>      <fct>    
 1 Audi         90    Comp~      25.9  29.1      32.3       20          26 Driver~ Front      6        
 2 Chevrolet    Cava~ Comp~       8.5  13.4      18.3       25          36 None    Front      4        
 3 Chevrolet    Cors~ Comp~      11.4  11.4      11.4       25          34 Driver~ Front      4        
 4 Chrysler     LeBa~ Comp~      14.5  15.8      17.1       23          28 Driver~ Front      4        
 5 Dodge        Spir~ Comp~      11.9  13.3      14.7       22          27 Driver~ Front      4        
 6 Ford         Tempo Comp~      10.4  11.3      12.2       22          27 None    Front      4        
 7 Honda        Acco~ Comp~      13.8  17.5      21.2       24          31 Driver~ Front      4        
 8 Mazda        626   Comp~      14.3  16.5      18.7       26          34 Driver~ Front      4        
 9 Mercedes-Be~ 190E  Comp~      29    31.9      34.9       20          29 Driver~ Rear       4        
10 Nissan       Alti~ Comp~      13    15.7      18.3       24          30 Driver~ Front      4        
11 Oldsmobile   Achi~ Comp~      13    13.5      14         24          31 None    Front      4        
12 Pontiac      Sunb~ Comp~       9.4  11.1      12.8       23          31 None    Front      4        
13 Saab         900   Comp~      20.3  28.7      37.1       20          26 Driver~ Front      4        
14 Subaru       Lega~ Comp~      16.3  19.5      22.7       23          30 Driver~ 4WD        4        
15 Volkswagen   Pass~ Comp~      17.6  20        22.4       21          30 None    Front      4        
16 Volvo        240   Comp~      21.8  22.7      23.5       21          28 Driver~ Rear       4   

tip

 

ctrl+shift+c 누르면 주석처리

 

# 1번  df_1 %>%  mutate(df_1,
#                  Price_range=Max.Price-Min.Price,
#                  Price_mean=(Max.Price+Min.Price)/2) 
# 
# 
# 2번  df_1 <- mutate(df_1,
#                Price_range=Max.Price-Min.Price,
#                Price_mean=(Max.Price+Min.Price)/2) 

1번 방식으로 하면 df_1에 자료가 할당되지는 않지만 바로 결과값을 볼수 있고 2번방식은 할당이 바로되며 결과값도 바로볼수있다.