てきどにがんばる統計解析: 変数間の連関をグラフで可視化する

iris を用いてカテゴリデータどうしの連関の強さをφ係数で表現して可視化する事を考える。

使用する主なパッケージの一覧

データの作成

各連続量をカテゴリ値に変換していく。変換のルールは下記。

low: 0%〜33% タイル
mid: 33%〜66% タイル
high: 66%〜100% タイル

また、共起を定義するための同一グループを表すための項目として id 列を追加している事に注意。

library(tidyverse)
library(widyr)

df.iris.cat <- tibble::as_tibble(iris) %>%

  dplyr::mutate(
    # 各レコードを特定できる一意な番号を割り振る
    id = row_number(),

    # 連続量をそれぞれカテゴリ化
    sepal_length = cut(
      Sepal.Length,
      breaks = quantile(Sepal.Length, probs = seq(0, 1, length.out = 4)),
      labels = c("low", "mid", "high"),
      include.lowest = T
    ),
    sepal_width = cut(
      Sepal.Width,
      breaks = quantile(Sepal.Width, probs = seq(0, 1, length.out = 4)),
      labels = c("low", "mid", "high"),
      include.lowest = T
    ),
    petal_length = cut(
      Petal.Length,
      breaks = quantile(Petal.Length, probs = seq(0, 1, length.out = 4)),
      labels = c("low", "mid", "high"),
      include.lowest = T
    ),
    petal_width = cut(
      Petal.Width,
      breaks = quantile(Petal.Width, probs = seq(0, 1, length.out = 4)),
      labels = c("low", "mid", "high"),
      include.lowest = T
    )
  ) %>%

  dplyr::select(
    id,
    species = Species,
    sepal_length,
    sepal_width,
    petal_length,
    petal_width
  )

df.iris.cat
id	species	sepal_length	sepal_width	petal_length	petal_width
1	setosa	low	high	low	low
2	setosa	low	mid	low	low
3	setosa	low	mid	low	low
4	setosa	low	mid	low	low
5	setosa	low	high	low	low

データの変換

可視化に向けてデータを変換していく。ここで使用していない pivot_wider と合わせて pivot_longer はとても便利。
※旧 tidyr::spread/tidyr::gather

df.iris.cor <- df.iris.cat %>%

  # long-format に変換
  tidyr::pivot_longer(species:petal_width, names_to = "feature", values_to = "category") %>%

  # 変数名とカテゴリ値(low, mid, high)を結合
  tidyr::unite(col = "category", feature, category, sep = "_") %>%

  # 各変数間のφ係数を算出
  # upper = False と指定する事で (item1, item2) の対称な組の片方を除外(下三角行列をイメージすると良いかも)
  # ex. (setosa, sepal_length_low) と (sepal_length_low, setosa)
  widyr::pairwise_cor(category, id, upper = F) %>%

  # 同一変数どうしのレコードを排除
  # ex. Species 同士でもφ係数の定義により -0.5 と算出されてしまう
  dplyr::filter(
    # item1 と item2 それぞれの prefix が異なる場合のみを対象
    stringr::str_extract(item1, pattern = "^(species|(sepal|petal)_(length|width))")
      != stringr::str_extract(item2, pattern = "^(species|(sepal|petal)_(length|width))")
  ) %>%

  dplyr::mutate(
    item1 = stringr::str_remove(item1, "^species_"),
    item2 = stringr::str_remove(item2, "^species_")
  )

df.iris.cor
item1	item2	correlation
setosa	sepal_length_low	0.8221449
setosa	sepal_width_high	0.5837769
sepal_length_low	sepal_width_high	0.4056025
setosa	petal_length_low	1.0000000
sepal_length_low	petal_length_low	0.8221449

可視化

ggraph を用いて可視化してみる

df.iris.cor %>%

  # 小さい係数を除去
  # 閾値の 0.4 はいくつか試して適当に決めた
  dplyr::filter(abs(correlation) > 0.4) %>%

  # tibble をグラフオブジェクトに変換して可視化
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +

    # 辺に関する設定
    ggraph::geom_edge_link(
      aes(
        # 相関の強い関係性は濃く・太く
        edge_alpha = abs(correlation),
        edge_width = abs(correlation),

        # 相関の正/負で辺の色を分ける
        color = factor(correlation > 0)
      ),
      show.legend = F
    ) +
    ggraph::scale_edge_width(range = c(0.35, 1)) +

    # 点に関する設定
    ggraph::geom_node_point(
      aes(
        # Species 由来かどうかで色とサイズを分ける
        colour = as.factor(stringr::str_detect(name, pattern = "^(setosa|versicolor|virginica)")),
        size = ifelse(stringr::str_detect(name, pattern = "^(setosa|versicolor|virginica)"), 7, 1)
      ),
      show.legend = F
    ) +
    ggplot2::scale_size_area() +

    # 点ごとにカテゴリ値を表示
    ggraph::geom_node_text(aes(label = name), vjust = -1, hjust = 0.5, check_overlap = T) +

    ggraph::theme_graph()

相関の正負およびによって色を変えており、正の相関である青色の線に注目すると Virginica/Setosa/Versicolor ごとにクラスタが生成されている事が分かる。

上記から読み取れる内容を列挙してみる。

Virginica の特徴として Petal.Length, Petal.Width が共に大きい
Setosa の特徴として Petal.Length, Petal.Width, Sepal.Length が共に小さい。また Sepal.Width Sepal.Length が大きい事とも若干の関連がある
Versicolor の特徴として Petal.Length, Petal.Width が共に大きくも小さくもない中間的な値である
Petal.Length と Petal.Width の間には正の相関がある

など。

まとめ

連続量であっても適当にカテゴリ化して連関を可視化する事である程度は変数間の関係性を把握する事が可能。連続量どうしであれば PairPlot の劣化版でしかないが、カテゴリデータとの関係性も測る事が出来る点は良いかも。
あと、widyr では連関の指標として相互情報量を算出する事も可能なのでそれを用いた連関の可視化も面白いかもしれない。

参考

Rによるテキストマイニング第4章

てきどにがんばる統計解析

2020年1月24日金曜日

変数間の連関をグラフで可視化する

使用する主なパッケージの一覧

データの作成

データの変換

可視化

まとめ

参考

0 件のコメント:

コメントを投稿