共用方式為


教學課程:使用 R 預測鱷梨價格

本教學課程提供 Microsoft Fabric 中 Synapse 資料科學 工作流程的端對端範例。 它會使用 R 來分析和可視化 美國 中的鱷梨價格,以建立可預測未來鱷梨價格的機器學習模型。

本教學課程涵蓋下列步驟:

  • 載入預設連結庫
  • 載入資料
  • 自訂數據
  • 將新套件新增至會話
  • 分析和可視化數據
  • 定型模型

Screenshot of avocados.

必要條件

  • 開啟或建立筆記本。 若要瞭解如何,請參閱 如何使用 Microsoft Fabric 筆記本

  • 將語言選項設定為 SparkR (R) 以變更主要語言。

  • 將筆記本附加至 Lakehouse。 在左側,選取 [新增 ] 以新增現有的 Lakehouse 或建立 Lakehouse。

載入連結庫

使用預設 R 執行時間的連結庫:

library(tidyverse)
library(lubridate)
library(hms)

載入資料

從讀取鱷梨價格。從因特網下載的 CSV 檔案:

df <- read.csv('https://synapseaisolutionsa.blob.core.windows.net/public/AvocadoPrice/avocado.csv', header = TRUE)
head(df,5)

操作資料

首先,為數據行提供更方便的名稱。

# To use lowercase
names(df) <- tolower(names(df))

# To use snake case
avocado <- df %>% 
  rename("av_index" = "x",
         "average_price" = "averageprice",
         "total_volume" = "total.volume",
         "total_bags" = "total.bags",
         "amount_from_small_bags" = "small.bags",
         "amount_from_large_bags" = "large.bags",
         "amount_from_xlarge_bags" = "xlarge.bags")

# Rename codes
avocado2 <- avocado %>% 
  rename("PLU4046" = "x4046",
         "PLU4225" = "x4225",
         "PLU4770" = "x4770")

head(avocado2,5)

變更資料類型、移除不必要的資料行,並新增總耗用量:

# Convert data
avocado2$year = as.factor(avocado2$year)
avocado2$date = as.Date(avocado2$date)
avocado2$month  = factor(months(avocado2$date), levels = month.name)
avocado2$average_price =as.numeric(avocado2$average_price)
avocado2$PLU4046 = as.double(avocado2$PLU4046)
avocado2$PLU4225 = as.double(avocado2$PLU4225)
avocado2$PLU4770 = as.double(avocado2$PLU4770)
avocado2$amount_from_small_bags = as.numeric(avocado2$amount_from_small_bags)
avocado2$amount_from_large_bags = as.numeric(avocado2$amount_from_large_bags)
avocado2$amount_from_xlarge_bags = as.numeric(avocado2$amount_from_xlarge_bags)


# Remove unwanted columns
avocado2 <- avocado2 %>% 
  select(-av_index,-total_volume, -total_bags)

# Calculate total consumption 
avocado2 <- avocado2 %>% 
  mutate(total_consumption = PLU4046 + PLU4225 + PLU4770 + amount_from_small_bags + amount_from_large_bags + amount_from_xlarge_bags)

安裝新套件

使用內嵌套件安裝,將新的套件新增至工作階段:

install.packages(c("repr","gridExtra","fpp2"))

載入所需的連結庫。

library(tidyverse) 
library(knitr)
library(repr)
library(gridExtra)
library(data.table)

分析和可視化數據

依區域比較傳統(非組織)鱷梨價格:

options(repr.plot.width = 10, repr.plot.height =10)
# filter(mydata, gear %in% c(4,5))
avocado2 %>% 
  filter(region %in% c("PhoenixTucson","Houston","WestTexNewMexico","DallasFtWorth","LosAngeles","Denver","Roanoke","Seattle","Spokane","NewYork")) %>%  
  filter(type == "conventional") %>%           
  select(date, region, average_price) %>% 
  ggplot(aes(x = reorder(region, -average_price, na.rm = T), y = average_price)) +
  geom_jitter(aes(colour = region, alpha = 0.5)) +
  geom_violin(outlier.shape = NA, alpha = 0.5, size = 1) +
  geom_hline(yintercept = 1.5, linetype = 2) +
  geom_hline(yintercept = 1, linetype = 2) +
  annotate("rect", xmin = "LosAngeles", xmax = "PhoenixTucson", ymin = -Inf, ymax = Inf, alpha = 0.2) +
  geom_text(x = "WestTexNewMexico", y = 2.5, label = "My top 5 cities!", hjust = 0.5) +
  stat_summary(fun = "mean") +
  labs(x = "US city",
       y = "Avocado prices", 
       title = "Figure 1. Violin plot of nonorganic avocado prices",
       subtitle = "Visual aids: \n(1) Black dots are average prices of individual avocados by city \n     between January 2015 and March 2018. \n(2) The plot is ordered descendingly.\n(3) The body of the violin becomes fatter when data points increase.") +
  theme_classic() + 
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 25, vjust = 0.65),
        plot.title = element_text(face = "bold", size = 15)) +
  scale_y_continuous(lim = c(0, 3), breaks = seq(0, 3, 0.5))

Screenshot that shows a graph of nonorganic prices.

專注於休斯頓地區。

library(fpp2)
conv_houston <- avocado2 %>% 
  filter(region == "Houston",
         type == "conventional") %>% 
  group_by(date) %>% 
  summarise(average_price = mean(average_price))
  
# Set up ts   

conv_houston_ts <- ts(conv_houston$average_price,
                 start = c(2015, 1),
                 frequency = 52) 
# Plot

autoplot(conv_houston_ts) +
  labs(title = "Time plot: nonorganic avocado weekly prices in Houston",
       y = "$") +
  geom_point(colour = "brown", shape = 21) +
  geom_path(colour = "brown")

Screenshot of a graph of avocado prices in Houston.

將機器學習模型定型

根據自動回歸綜合移動平均(ARIMA):為休斯頓區域建立價格預測模型:

conv_houston_ts_arima <- auto.arima(conv_houston_ts,
                                    d = 1,
                                    approximation = F,
                                    stepwise = F,
                                    trace = T)
checkresiduals(conv_houston_ts_arima)

Screenshot that shows a graph of residuals.

顯示來自休斯頓 ARIMA 模型的預測圖表:

conv_houston_ts_arima_fc <- forecast(conv_houston_ts_arima, h = 208)

autoplot(conv_houston_ts_arima_fc) + labs(subtitle = "Prediction of weekly prices of nonorganic avocados in Houston",
       y = "$") +
  geom_hline(yintercept = 2.5, linetype = 2, colour = "blue")

Screenshot that shows a graph of forecasts from the ARIMA model.