2020年11月のブログ記事一覧-裏 RjpWiki

を書いたが，その後，描いてみたら，日本語が使えなくなっていた。

その間に何をやったかというと，「Mac OS をバージョンアップして，macOS バージョン 10.14 Mojave にした」ことが原因か。

/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/matplotlib/font_manager.py:1238: UserWarning: findfont: Font family ['IPAMincho'] not found. Falling back to DejaVu Sans.
(prop.get_family(), self.defaultFamily[fontext]))

というのが出る。

matplotlibrc の IPAMincho を IPAGothic に変えても，相変わらず ['IPAMincho'] not found と言い張るので，1238 行でエラーが起きたという font_manager.py をみてみた。

しばらく，あちこち見たけど，初心に返り先頭にあるコメントを読んでみた。

Experimental support is included for using `fontconfig` on Unix
variant platforms (Linux, OS X, Solaris). To enable it, set the
constant ``USE_FONTCONFIG`` in this file to ``True``.

と書いてあったので，そうしたら，また，ちゃんと日本語が使えるようになった。

頑張ってます。クリックお願いします。

Fisher's exact test(Python)

2020年11月29日 | Python

以前にどこかにも書いたけど，

$ pip install FisherExact
で
>>> from FisherExact import fisher_exact
>>> fisher_exact([[1,3],[4,2]])
はできるんだけど。2 x 2 より大きい分割表を指定すると segmentation fault で落ちる

以下のようにすれば，R を呼んで，結果は出る

$ pip install rpy2

>>> import numpy as np
>>> import rpy2.robjects.numpy2ri
>>> from rpy2.robjects.packages import importr
>>> rpy2.robjects.numpy2ri.activate()

>>> stats = importr('stats')
>>> m = np.array([[4,4],[4,5],[10,6]]) # 対象とする分割表の定義
>>> res = stats.fisher_test(m) # fisher.test 起動
>>> print('p-value: {}'.format(res[0][0]))

やっぱ，Python は，ヤダナ

頑張ってます。クリックお願いします。

R と Python での分布関数について

2020年11月29日 | ブログラミング

そのたんびに確認しないといけないので，図にしておこう

頑張ってます。クリックお願いします。

Pweave と Sweave

2020年11月28日 | Python

TeXShop で，Pweave.engine と Sweave.engine を使い分けていたけど，やはり面倒

自動的にいずれか適切な方を起動するというようにしたほうが何かと便利

Pnw という拡張子を認識しないと愚痴ったけど，むしろそれでよかった

Sweave.engine の最初の方にちょっと書き足した

*.Rnw の先頭行が "%Pweave.Pnw であれば，pweave する。

ms = function(file, makeindex=FALSE, silent=FALSE, deletePdfs=FALSE, deleteWorkfiles=FALSE, ...) {
Sys.setlocale("LC_ALL", "ja_JP.UTF-8")
if (grepl("\\.", file) == FALSE) {
    file = paste(file, "Rnw", sep=".")
}
cat("Input file:", file, "\n")
con = file(file, open="r", encoding="utf-8")
a = readLines(con, 1)
close(con)
if (a == "%Pweave.Pnw") {
    system(sprintf("pweave -f tex %s", file))
} else {
   Sweave(file, encoding="utf-8")
}
base = sub(".(R|S)nw", "", file)

頑張ってます。クリックお願いします。

一元配置分散分析のパワーアナリシスを巡って

2020年11月28日 | ブログラミング

この記事の最後に掲載するシミュレーションプログラムを作ってみた。

プログラムは，群の平均値をベクトル，各群共通の標準偏差を引数に持つ。
power.anova.test() または pwr.anova.test() で所要の power を得るためのサンプルサイズ n を求める。
シミュレーションデータを生成し，一元配置分散分析により p.value < sig.level となる確率を集計する。
この確率が，power のシミュレーション値である。

実行例

***** シミュレーションのあらまし
群の数 k = groups = 3
各群の平均値 = 145, 152, 163
標準偏差は各群共通 sd = 10

***** power.anova.test() で計算してみる
between.var は平均値の不偏分散 bv = var(group.means) = 82.33333333
within.var は各群共通の標準偏差の二乗（つまり分散） sd^2 = 100
power.anova.test(groups=k, between.var=bv, within.var=sd^2, sig.level=sig.level, power=power)

Balanced one-way analysis of variance power calculation

groups = 3
n = 6.953779905
between.var = 82.33333333
within.var = 100
sig.level = 0.05
power = 0.8

NOTE: n is number in each group

power >= 0.8 となるのは n = 7 (計算された n = 6.953779905 )

***** pwr.anova.test() でも計算してみる
pwr.anova.test() で使われる効果量 f = sqrt(bv * (k-1)/k) / sd = 0.740870359
pwr.anova.test(k=k, f=f, sig.level=sig.level, power=power)

Balanced one-way analysis of variance power calculation

k = 3
n = 6.953782438
f = 0.740870359
sig.level = 0.05
power = 0.8

NOTE: n is number in each group

同じ n が得られる。

***** 1000 回のシミュレーション
各群の平均値が 145, 152, 163 で，各群共通の標準偏差が 10 である 3 群のシミュレーションデータ生成と分析
効果量 f の平均値 = 0.8039049106　　理論値は上述の f = 0.740870359
検出力 power = 0.787　　理論値は 0.8

***** ちなみに，power.anova.test() での within.var は，シミュレーションの 1 試行例では，
各群の不偏分散 = 188.795983946315, 131.296455655148, 27.5106187727577
各群の不偏分散の平均 = mean(group.vars) = 115.8676861 ==> within.var

***** anova(aov()) の結果例と照合
Reiduals の行の Mean Sq の数値と一致
Analysis of Variance Table

Response: x
Df Sum Sq Mean Sq F value Pr(>F)
g 2 2832.0374 1416.01871 12.221 0.00044391
Residuals 18 2085.6184 115.86769

プログラムソース

sim = function(n, group.means, sd = 1, sig.level = 0.05, power = 0.8, trial = 1000, plot.flag = TRUE) {
k = length(group.means)
cat("***** シミュレーションのあらまし\n")
cat("群の数 k = groups =", k, "\n")
cat("各群の平均値 =", paste(group.means, collapse = ", "), "\n")
cat("標準偏差は各群共通 sd =", sd, "\n")
if (plot.flag) {
layout(matrix(1:2, 2))
old = par(mgp = c(1.8, 0.8, 0), mar = c(3, 3, 2, 1))
x = seq(min(group.means) - 3 * sd, max(group.means) + 3 * sd, length = 200)
range.x = range(x)
range.y = range(dnorm(x, mean = group.means[1], sd = sd))
plot(0, 0, xlim = range.x, ylim = range.y, type = "n", xlab = "", ylab = "密度")
abline(v = group.means, xpd = TRUE, col = "gray80", lty = 3)
title("データの分布")
text(range.x[2], mean(range.y), paste("標準偏差 =", sd), pos = 2)
for (i in 1:k) {
y = dnorm(x, mean = group.means[i], sd = sd)
lines(x, y, col = i)
}
}
cat("***** power.anova.test() で計算してみる\n")
bv = var(group.means)
cat("between.var は平均値の不偏分散 bv = var(group.means) =", bv, "\n")
cat(" within.var は各群共通の標準偏差の二乗（つまり分散） sd^2 =", sd^2, "\n")
cat("power.anova.test(groups=k, between.var=bv, within.var=sd^2, sig.level=sig.level, power=power)\n")
pat = power.anova.test(groups = k, between.var = bv, within.var = sd^2,
sig.level = sig.level, power = power)
print(pat)
n = ceiling(pat$n)
cat("power >= ", power, "となるのは n =", n, "(計算された n =", pat$n, ")\n")
cat("***** pwr.anova.test() でも計算してみる\n")
f = sqrt(bv * (k - 1)/k)/sd
cat("pwr.anova.test() で使われる効果量 f = sqrt(bv * (k-1)/k) / sd =", f, "\n")
cat("pwr.anova.test(k=k, f=f, sig.level=sig.level, power=power)\n")
print(pwr.anova.test(k = k, f = f, sig.level = sig.level, power = power))
if (plot.flag) {
x = seq(min(group.means) - 3 * sd/sqrt(n),
max(group.means) + 3 * sd/sqrt(n), length = 200)
range.y = range(dnorm(x, mean = group.means[1], sd = sd/sqrt(n)))
plot(0, 0, xlim = range.x, ylim = range.y, type = "n", xlab = "", ylab = "密度")
abline(v = group.means, xpd = TRUE, col = "gray80", lty = 3)
title("標本平均の分布")
text(range.x[2], mean(range.y), paste("標準誤差 =", round(sd/sqrt(n), 4)), pos = 2)
for (i in 1:k) {
y = dnorm(x, mean = group.means[i], sd = sd/sqrt(n))
lines(x, y, col = i)
}
layout(1)
par(old)
}
g = factor(rep(1:k, n))
p.value = numeric(trial)
f = numeric(trial)
cat("*****", trial, "回のシミュレーション\n")
cat("各群の平均値が", paste(group.means, collapse = ", "), "で，各群共通の標準偏差が",
sd, "である", k, "群のシミュレーションデータ生成と分析\n")
for (i in 1:trial) {
x = rnorm(k * n, mean = group.means, sd = sd)
p.value[i] = oneway.test(x ~ g, var.equal = TRUE)$p.value
MS = anova(aov(x ~ g))$"Mean Sq"
f[i] = sqrt(MS[1]/n * (k - 1)/k/MS[2])
}
cat("効果量 f の平均値 =", mean(f), "\n")
cat("検出力 power =", mean(p.value < sig.level), "\n")
cat("***** ちなみに，within.var は，シミュレーションの 1 試行では，\n")
group.vars = tapply(x, g, var)
wv = mean(group.vars)
cat("各群の不偏分散 =", paste(group.vars, collapse = ", "), "\n")
cat("各群の不偏分散の平均 = mean(group.vars) =", wv, " ==> within.var\n")
cat("***** anova(aov()) の結果例と照合\n")
cat("Reiduals の行の Mean Sq の数値と一致\n")
print(anova(aov(x ~ g)))
}

library(pwr)
options(digits = 10)

sim(group.means = c(145, 152, 163), sd = 10, trial = 1000, plot = TRUE)

頑張ってます。クリックお願いします。

pwr.anova.test() と power.anova.test()

2020年11月27日 | ブログラミング

pwr.anova.test() と power.anova.test() 両方あって，面倒くさい。同じ条件で計算したとき，どちらも同じ答えにならないと困るので，調べてみた。

分散分析のサンプルサイズ計算―pwr.anova.test()を使う方法でも取り上げられていたのだけど，結論がちょっと違ったので，まとめておく。

まず，両方のプログラムで p.body で定義される関数本体の違いを見る。

power.anova.test で群の個数を表す変数が groups なので，pwr.anova.test と揃えるために k に置換して両者を以下に示す。

power.anova.test
lambda <- (k - 1) * n * (between.var/within.var)
pf(qf(sig.level, k - 1, (n - 1) * k, lower.tail = FALSE),
k - 1, (n - 1) * k, lambda, lower.tail = FALSE)

pwr.anova.test
lambda <- k * n * f^2
pf(qf(sig.level, k - 1, (n - 1) * k, lower = FALSE),
k - 1, (n - 1) * k, lambda, lower = FALSE)

両者は lamda の計算式が異なる。しかし，power.anova.test で between.var と within.var を使おうが，pwr.anova.test で between.var と within.var から計算される f を使おうが，同じ結果になるためには between.var と within.var と f の関係式を調べればよい。

つまり，(k - 1) * n * (between.var/within.var) = k * n * f^2 ならば，

f = sqrt((between.var * (k-1)/k) / within.var)

さて，between.var とは何者かということもあるが，これは

各群の平均値を groupemean としたとき，between.var = var(groupmean) で，一元配置分散分析表の群間の平均平方 Mean Sq を（各群で同じ）サンプルサイズ n で割ったものに等しい。（最後に書くけど，var() がくせ者）

前出の，分散分析のサンプルサイズ計算―pwr.anova.test()を使う方法では，

群間の平均平方 Mean Sq 484.7, n = 4, 群内の平均平方 35.9 として，効果量 f=sqrt(484.7/4)/sqrt(35.9) とした計算結果が示されている。

> pwr.anova.test(k=3, f=sqrt(484.7/4)/sqrt(35.9), power=0.8)

Balanced one-way analysis of variance power calculation

k = 3
n = 2.289809 # 一番最後の結果に示した power.anova(... between.var=484.7/4 * (3/2), ...) の結果と同じ
f = 1.837212
sig.level = 0.05
power = 0.8

NOTE: n is number in each group

これを power.anova.test() でやってみると，結果が異なる。

> power.anova.test(groups=3, between.var=484.7/4, within.var=35.9, power=0.8)

Balanced one-way analysis of variance power calculation

groups = 3
n = 2.712944 # 2.289809 と合わない
between.var = 121.175
within.var = 35.9
sig.level = 0.05
power = 0.8

NOTE: n is number in each group

そこで，上に示したように効果量 f の計算に使う between.var を修正して分析する。

> pwr.anova.test(k=3, f=sqrt(484.7/4 * 2/3)/sqrt(35.9), power=0.8)

Balanced one-way analysis of variance power calculation

k = 3
n = 2.712935 # 誤差範囲で 2.712944 と同じになった
f = 1.500077
sig.level = 0.05
power = 0.8

NOTE: n is number in each group

この比較は power.anova.test を基準にしたのだが，pwr.anova.test を基準にしても構わない。

> power.anova.test(groups=3, between.var=484.7/4 * (3/2), within.var=35.9, power=0.8)

Balanced one-way analysis of variance power calculation

groups = 3
n = 2.289779 # 最初に示した pwr.anova.test の結果と一致する
between.var = 181.7625
within.var = 35.9
sig.level = 0.05
power = 0.8

NOTE: n is number in each group

いずれを基準としてもよいのだけど，別々では困る。

between.var = var(groupmean) の自由度は R では k - 1 だから，これに基づいて計算する場合は (k-1)/k の修正が必要かなあ？と思う次第。つまり，power.anova.test() を基準とする。

頑張ってます。クリックお願いします。

import のやり方

2020年11月27日 | Python

ディレクトリ A に __init__.py と B.py という Python プログラムファイルがあり，その中に C，D という関数が定義されている。

__init__.py があるディレクトリは，Python でいうところの「パッケージ」である。

ディレクトリ A
　__init__.py
　ファイル B.py
　　関数 C
　　関数 D

C，D をどのようにして呼ぶか。

1. まずディレクトリ A へ移動してから

1.1. コマンドラインで前もって A へ移動し，python を起動した場合

>>> import B
>>> B.C(1,2)
これは，関数 C の出力です 3
>>> B.D(2,3)
これは，関数 D の出力です 6

>>> from B import C, D
>>> C(9,3)
これは，関数 C の出力です 12
>>> D(9,3)
これは，関数 D の出力です 27
>>>

1.2. ディレクトリ A 以外から python を起動した場合

>>> import os
>>> os.chdir("A のパス指定")
後は同じ

2. どこからも使えるようにする

2.1. path を設定する

>>> import os
>>> import sys
>>> sys.path.append("/Users/foo/Desktop/A/")

>>> import B

>>> B.C(1,3)
これは，関数 C の出力です 4

>>> B.D(4,5)
これは，関数 D の出力です 20

>>> from B import C, D

>>> B.C(10, 20)
これは，関数 C の出力です 30

>>> B.D(11, 34)
これは，関数 D の出力です 374

2.2. 環境変数 PYTHONPATH を設定する

tcsh なら

setenv PYTHONPATH ${PYTHONPATH}:/Users/なんとかかんとか/A/

bash なら

export PYTHONPATH=${PYTHONPATH}:/Users/なんとかかんとか/A/

頑張ってます。クリックお願いします。

pweave で R の xtable 相当の機能を装備する（その2, Python）

2020年11月25日 | Python

まったく。

日本語(utf-8)が使えないのは，Pweave.engine で経験・解決済みだろ。

ということで，xtable 関数に原因があるわけではないので，それを呼び出す前に locale をセットする必要があるということ。

* 使用法:

setlocale は，xtable の定義の後に実行されているので，普通は再度設定する必要はない。

xtable の引数は，
   xtable(x, caption, label, fmt, align, pos="htbp")

x   pandas のデータフレームオブジェクト caption 表のキャプション
label   table 環境につけるラベル。
   label+".tex" という名前でファイルが作られるので，必要なところで \input{*} で読み込む。
fmt   表の各列の形式を指定するリスト
   文字列なら "s"，数値なら小数部の桁数 "3" など，整数なら "0" または "d" を指定する。
   表の列あたり 1 文字ずつ，全部で表の列数ぶんの長さの文字列で指定(例 "s053"
align   各列での配置法を表す 1 つの文字列(長さは表の列数に等しい) 例 "rllcr"
pos   表の位置。デフォルトは "htbp"

* 使用例

<<>>=
import pandas as pd x = pd.DataFrame({
"a" : [1,2,3,4,5],
"b" : [1.1,2.2,3.3,4.5,5.5], "c" : ["a", "b", "c", "d", "e"] })
xtable(x, "xtable 使用例", "label", "d1s", "rrr")
@

\input{labe.tex}

* xtable

<<>>=
def xtable(x, caption, label, fmt, align, pos="htbp"):
    import scipy as sp
    colnames = x.columns
    f = open(label+".tex", "w")
    x = sp.array(x)
    n = x.shape
    print("\\begin{{table}}[{0:s}]". format(pos), file=f)
    print("\\begin{center}", file=f)
    print("\\caption{{{0:s}}}". format(caption), file=f)
    print("\\label{{{0:s}}}". format(label), file=f)
    print("\\begin{{tabular}}{{{0:s}}} \\hline". format(align), file=f)
    for i in range(n[0]):
        if i == 0:
            for j in range(n[1]):
                print("{0:s}". format(colnames[j]), end="", file=f)
                if j == n[1]-1:
                    print(" \\\\ \\hline", file=f)
                else:
                    print(" & ", end="", file=f)
        for j in range(n[1]):
            if fmt[j] == "s":
                print("{0:s}". format(x[i,j]), end="", file=f)
            elif fmt[j] == "d":
                print("{0:d}". format(int(x[i,j])), end="", file=f)
            else:
                digits = int(fmt[j])
                print("{1:.{0:d}f}". format(digits, float(x[i,j])), end="", file=f)
            if j == n[1]-1:
                print(" \\\\", end="", file=f)
            else:
                print(" & ", end="", file=f)
        if i == n[0]-1:
            print(" \\hline", file=f)
        else:
            print(file=f)
    print("\\end{tabular}", file=f)
    print("\\end{center}", file=f)
    print("\\end{table}", file=f)
    f.close()

import locale
locale.setlocale(locale.LC_ALL, "ja_JP.UTF-8")
@

頑張ってます。クリックお願いします。

matplotlib で日本語を使えるようにする(Mac, Python)

2020年11月25日 | Python

matplotlib で日本語を使えるようにする

Mac の場合，デフォルトで使える日本語フォントは AppleGothic である。まずは，これを使ってみる。

matplotlib のさまざまな設定は，matplotlibrc ファイルに書いてある。このファイルで，使うフォントの設定を日本語フォントを使うように書き換えればよい。

matplotlibrc は，Python で以下のようにすれば分かる。

>>>> import matplotlib as mpl
>>>> mpl.matplotlib_fname()

表示されたパスが /usr/local/Cellar/python3/途中省略/matplotlib/mpl-data/matplotlibrc のようなものだとすれば，それは，Python がインストールされたときに自動的に作られたものである。まずこれを自分のホームディレクトリへコピーする。

コンソール(ターミナル)で
$ cp /usr/local/Cellar/python3/途中省略/matplotlib/mpl-data/matplotlibrc ̃/.matplotlib/
とすれば，自分のホームディレクトリの .matplotlib の下に matplotlibrc がコピーされる。

この時点で再度，

>>>> import matplotlib as mpl
>>>> mpl.matplotlib_fname()

としたとき，/Users/foo/.matplotlib/matplotlibrc のようになることを確認しよう。

コピーされた matplotlibrc をエディタで開き，以下のように font.serif，font.sans-serif を含む行の先頭の # を削除して，フォントリストの先頭に AppleGothic を加える。

その他の設定も自分好みに変更してもよい。

旧
#font.serif : DejaVu Serif, Bitstream Vera Serif, 以下略
#font.sans-serif : DejaVu Sans, Bitstream Vera Sans, 以下略

新
font.serif : AppleGothic, DejaVu Serif, Bitstream Vera Serif, 以下略
font.sans-serif : AppleGothic, DejaVu Sans, Bitstream Vera Sans, 以下略

設定が終わったら，以下のプログラムを実行してみる。

>>>> df = pd.DataFrame(np.array([[1,2,3,4,5,6], [3,6,7,4,8,9]]).T, columns = ['月','製品A'])
>>>> df.plot(kind = 'line', x = '月', y = '製品A').set(ylabel = '売上')

もしフォントが気に入らなければ(注1），「IPA フォントダウンロードページ」〔注2）から，TTF ファイルの「4 書体パック (Ver.003.03) IPAfont00303.zip」をダウンロード・展開する。

Mac の場合，展開された 4 ファイルを，ライブラリ/Fonts ディレクトリへ移動する。

4 つの書体の名前，IPAPGothic，IPAGothic，IPAPMincho，IPAMincho から，いずれかを選んで，上述の matplotlibrc に設定する〔注3）。

======================================================================

（注1）一部の漢字が用意されていない?例えば「売」はトーフになる。
（注2）http://ipafont.ipa.go.jp/old/ipafont/download.html
（注3）上の図のフォントは IPAMincho である。

頑張ってます。クリックお願いします。

pweave で R の xtable 相当の機能を装備する(Python)

2020年11月25日 | Python

pweave のチャンクではまだ results=tex がサポートされていない，まあ何とかすることはできるだろうと思ってやってみた

ただし，関数で *.tex ファイルを保存して \input{*} で読み込もうという手はず。

<<>>=
def xtable(x, caption, label, fmt, align, pos="htbp"):
    import scipy as sp
    colnames = x.columns
    f = open("test.tex", "w")
    x = sp.array(x)
    n = x.shape
    print("\\begin{{table}}[{0:s}]". format(pos), file=f)
    print("\\begin{center}", file=f)
    print("\\caption{{{0:s}}}". format(caption), file=f)
    print("\\label{{{0:s}}}". format(label), file=f)
    print("\\begin{{tabular}}{{{0:s}}} \\hline". format(align), file=f)
    for i in range(n[0]):
        if i == 0:
            for j in range(n[1]):
                print("{0:s}". format(colnames[j]), end="", file=f)
                if j == n[1]-1:
                    print(" \\\\ \\hline", file=f)
                else:
                    print(" & ", end="", file=f)
        for j in range(n[1]):
            if fmt[j] == "s":
                print("{0:s}". format(x[i,j]), end="", file=f)
            elif fmt[j] == "d":
                print("{0:d}". format(int(x[i,j])), end="", file=f)
            else:
                digits = int(fmt[j])
                print("{1:.{0:d}f}". format(digits, float(x[i,j])), end="", file=f)
            if j == n[1]-1:
                print(" \\\\", end="", file=f)
            else:
                print(" & ", end="", file=f)
        if i == n[0]-1:
            print(" \\hline", file=f)
        else:
            print(file=f)
    print("\\end{tabular}", file=f)
    print("\\end{center}", file=f)
    print("\\end{table}", file=f)
    f.close()
@

という関数を定義しておいて，LaTeX で表示したいものを DataFrame にして，変換関数 xtable に渡し，ファイルに書いた LaTeX 形式のファイルを \input してタイプセットということで，

<<>>=
import pandas as pd
x = pd.DataFrame({
    "a" : [1,2,3,4,5],
    "b" : [1.1,2.2,3.3,4.5,5.5],
    "c" : ["a", "b", "c", "d", "e"]
    })

xtable(x, 'title', "label", ["3", "0", "s"], "rrr")
@
\input{test}

しかし，ここに落とし穴が。

title を日本語で指定すると，だめになる。

Python で utf-8 エンコーディングの日本語をファイルに書いてからそれを読む方法がわからん。

できないわけはないと思うが，できないなら Python の大チョンボだ。

どんなアプリでもプログラム言語でも，utf-8 文字列をファイルに書いて，それをどんなアプリでもプログラム言語でも，読み込んだらちゃんと utf-8 文字列になるというのは当たり前と思うが？？？

解決法を知っている方はお教えいただきたい。

#########################################

プログラムの先頭で，

import locale
locale.setlocale(locale.LC_ALL, "ja_JP.UTF-8")

のようなおまじないを唱えておくとよいようだ。

頑張ってます。クリックお願いします。

Python で直線回帰

2020年11月24日 | Python

Qiita なんかで，数週間おきに，直線回帰についての記事が上がったりする。

いずれも sklearn だっけ？を使うか，あるいはそれこそ各変数の平均値を求める段階から自分でプログラムを書くかという両極端。後者は，そこまでやらなくてもという，前者はたかが直線回帰でそんなにたくさん記述が必要かという。いずれも，残念感満載の記事。それが何回も何回も出てくる。もう，うんざりだ。

Python に用意されているほどほどの機能を使って，端的に直線回帰を行う関数を書いてみる。

scipy.stats の linregress や pearsonr は使わないことにして（だって，それ使ったらむっちゃ簡単ですやん。Qiita 成り立たへん）

import numpy as np

def lm(x, y):
vx, vxy, _, vy = np.ravel(np.cov(x, y))
b = vxy / vx
return np.mean(y) - b * np.mean(x), b, vxy / np.sqrt(vx * vy)

独立変数のベクトル x と，従属変数のベクトル y を与えれば，回帰直線の切片と傾きと，ついでに変数間の相関係数を返す。必要最小限の機能だ。

引数は，numpy ndarray でも，リストでも，pandas データフレームから取り出した pandas.Series でもよい。

>>> def lm(x, y):
... vx, vxy, _, vy = np.ravel(np.cov(x, y))
... b = vxy / vx
... return np.mean(y) - b * np.mean(x), b, vxy / np.sqrt(vx * vy)

>>> lm(range(5), [3, 2, 6, 10, 15])
(0.7999999999999998, 3.2, 0.9444501377615284)

>>> import pandas as pd
>>> df = pd.DataFrame({"x": range(5), "y": [3, 2, 6, 10, 15]})
>>> lm(df.x, df.y)
(0.7999999999999998, 3.2, 0.9444501377615284)

一番，オーソドックスには

>>> x = np.array(range(5))
>>> y = np.array([3, 2, 6, 10, 15])
>>> intercept, slope, r = lm(x, y)
(0.7999999999999998, 3.2, 0.9444501377615284)

重相関係数が欲しいなら

>>> r**2
0.89198606271777

予測値が欲しいなら

>>> intercept + slope * x
array([ 0.8, 4. , 7.2, 10.4, 13.6])

散布図と回帰直線が欲しいなら，

import matplotlib.pyplot as plt
plt.scatter(x, y)
x2 = np.array([min(x), max(x)])
y2 = intercept + slope * x2
plt.plot(x2, y2)
plt.show()

他に，何が欲しい？

頑張ってます。クリックお願いします。

π を明示的には使わない，ビュフォンの針のシミュレーション(Python)

2020年11月23日 | Python

ビュフォンの針のシミュレーションプログラムはたくさん書かれているが，プログラム内で定数 π を使っているものがほとんどである。

π を使わないシミュレーションプログラムを書いたので記録に残しておく。

方針は，原点を左下の頂点とする一辺 1 の正方形内のランダムな点の座標（x2，y2）のうち，原点からの距離が 1 以下の点について sin(theta) は三角関数の定義により y2/sqrt(x2**2 + y2**2) であることを使う。

以下のプログラムで DEBUG = True として実行すると，theta の分布は一様分布になっていることが確認できる。

なお，プログラム中の REMARK と書かれた if 文を外すと，半径が 1 より大きい点も採用されてしまうので，theta は一様分布にはならない。

import random
import math

DEBUG = False
if DEBUG:
import numpy as np
import matplotlib.pyplot as plt

n = 100000
d = 10
l = d / 2
i = 0
cross = 0
if DEBUG: theta = np.zeros(n)
while i < n:
y = random.uniform(0, d / 2) # 針の中心から最も近い平行線までの距離
# 0 <= theta <= π/2 の一様分布を求める
x2 = random.uniform(0, 1) # 一辺 1 の正方形内の座標 (x2, y2)
y2 = random.uniform(0, 1)
r2 = x2**2 + y2**2 # 原点から (x2, y2) までの平方距離
if r2 < 1: ## REMARK　円の内部の点に限定する
r = math.sqrt(r2) # 原点から (x2, y2) までの距離
sine_theta = y2 / r # sine の定義により
if DEBUG: theta[i] = math.asin(sine_theta)
i += 1
if y <= l / 2 * sine_theta: # 平行線と交わる
cross += 1

if DEBUG: plt.hist(theta, bins=np.arange(0, np.pi/2, np.pi/60))
print(2 * l * n / (cross * d))

頑張ってます。クリックお願いします。

ややこしいプログラム -- R の eval() と quote()

2020年11月23日 | ブログラミング

R の eval() は，Python の eval() とは全く異なるものである。ということを書きたい。

以下の例示プログラムを使って説明しよう。

1: func = function(x = NA, y = NA) {
2: z = function(f0) {
3: return(f0(10))
4: }
5: f.body = quote(2 * x + 3 * y + 5)
6: if (is.na(x)) {
7: x = z(function(x) eval(f.body))
8: print(x)
9: } else if (is.na(y)) {
10: y = z(function(y) eval(f.body))
11: print(y)
12: }
13: }

このプログラムは

func(x = 1)
func(y = 2)

のように利用される。

func(x = 1) の場合，y = NA なので，9 行目の条件判定が成立し，10 行目の y = z(function(y) eval(f.body)) が実行される。
eval(f.body) は，5 行目で定義されている f.body の右辺の quote のなかみと置き換えられる。つまり，y = z(function(y) 2 * x + 3 * y + 5) と書かれるのと同じ効果を生む。
z は 2〜4 行目で定義されている関数で，引数は一つだけで，その引数は別の関数である。
10 行目は，無名関数 function(y) 2 * x + 3 * y + 5 を引数として z を呼ぶことになる。この無名関数の変数は y であり，x は呼ばれたときに x が持っている定数である。
関数 z は仮引数の関数 f0 の変数に 10 を代入した結果を返す（実際の z はもっと複雑なことをするのだろうが）。
f0(y) = 2 * x + 3 * y + 5 で，f0(10) を求めるということである。x の値は 1 なので，f0(10) = 2 * 1 + 3*10 + 5 = 37 が表示される。

func(y = 1) の場合，x = NA なので，6 行目の条件判定が成立し，7 行目の x = z(function(x) eval(f.body)) が実行される。
これは前述の場合と同じように，x = z(function(x) 2 * x + 3 * y + 5) と書かれるのと同じで，無名関数 function(x) 2 * x + 3 * y + 5 を引数として z を呼ぶ。この無名関数の変数は x であり，y は定数である。
関数 z は f0(x) = 2 * x + 3 * y + 5 で，y の値は 2 であり，f0(10) = 2 * 10 + 3 * 2 + 5 = 31 が返される。

つまり，z に渡される関数の本体はいずれの場合にも 2 * x + 3 * y + 5 であるが，本体中に現れる x, y のどちらが変数であるか，function(x) なのか function(y) なのかが異なる。関数本体が 5 行目で表されるような簡単なものならば，7, 10 行目に直接書き込んでも良いのだが，長くて複雑な場合や，関数本体に複数の変数を含むような場合にはそれぞれに対応して 8，10 行目以外にも複数箇所で同じ関数本体を重複して書かねばならないことになる。それを避けるための手段が eval() と quote() を使うということである。

さて，長くなったが，これを Python で書くと

1: import numpy as np
2: from numpy import nan as NA
3:
4: def func(x = NA, y = NA):
5: def z(f0):
6: return f0(10)
7: if np.isnan(x):
8: def f_x(x):
9: return 2 * x + 3 * y + 5
10: x = z(f_x)
11: print(x)
12: elif np.isnan(y):
13: def f_y(y):
14: return 2 * x + 3 * y + 5
15: y = z(f_y)
16: print(y)

となり，

func(x = 1)
func(y = 2)

で呼び出すと，R のプログラムと同じ働きをする。
R のような eval() と quote() の対がないので，関数本体が同じ（9, 14 行目）だが，どれが変数かを示すところが違う（8，13 行目）関数定義を2箇所に書かねばならない。
関数が短い場合は，以下のように lambda 記法で書けば行数が少なくて済むこともあるが，同じようなことを複数回書かねばならないことに違いはない。

if np.isnan(x):
x = z(lambda x: 2 * x + 3 * y + 5)
print(x)
elif np.isnan(y):
y = z(lambda y: 2 * x + 3 * y + 5)
print(y)

ところで，Python にも eval() はある。文字列で表されている式を評価するものである。
body = '2 * x + 3 * y + 5'
で，x = 10; y = 2 のとき，eval(body) は 2 * x + 3 * y + 5 に x, y を代入して，式を評価し，結果 37 を返す。
x = 1; y = 10 のとき，eval(body) は 2 * x + 3 * y + 5 に x, y を代入して，式を評価し，結果 31 を返す。
なんだ，同じようなことをやってるじゃないか？とお思いかもしれないが，以下のようなプログラムは動かない。

1: import numpy as np
2: from numpy import nan as NA
3:
4: def func(x = NA, y = NA):
5: def z(f0):
6: return f0(10)
7: body = "2 * x + 3 * y + 5"
8: if np.isnan(x):
9: x = z(lambda x: eval(body))
10: print(x)
11: elif np.isnan(y):
12: y = z(lambda y: eval(body))
13: print(y)

2 箇所の eval(body) は，x, y いずれかが nan なので，eval(body) は nan になる。

というように，R の eval() は Python の eval とは全くの別物ということである。

Python に R の eval() のようなものはないのだろうか？？

頑張ってます。クリックお願いします。

関数において，特定の値 NA を取る引数の個数

2020年11月22日 | ブログラミング

R でいえば

func = function(a = NA, b = NA, c = NA) {
print(sum(sapply(list(a, b, c), is.na)))
}

※注
print(sum(sapply(list(a, b, c), is.na)))
よりは
print(sum(is.na(c(a, b, c))))
のほうがよいが，以下の Python プログラムの場合と近づけるためにこのように書いた

Python でいえば

import numpy as np
NA = np.nan

def func(a = NA, b = NA, c = NA):
print(sum(np.isnan(arg) for arg in [a, b, c]))

のような関数定義があったとき，func が呼ばれたときの引数の値が NA であったものの個数をチェックするにはどうするかということ。

R では

func = function(a = NA, b = NA, c = NA) {
print(sum(sapply(list(a, b, c), is.na)))
}

func() # 3
func(a = 1) # 2
func(a = 3, c = 4) # 1
func(a = 1, b = 4, c = 9) # 0

Python では

import numpy as np
NA = np.nan

def func(a = NA, b = NA, c = NA):
print(sum(np.isnan(arg) for arg in [a, b, c]))

func() # 3
func(a = 1) # 2
func(a = 3, c = 4) # 1
func(a = 1, b = 4, c = 9) # 0

となる。

決して，以下のようにしてはならない。どの場合も 0 になってしまう。

def func(a = NA, b = NA, c = NA):
print(sum(arg == NA for arg in [a, b, c]))

func() # 0
func(a = 1) # 0
func(a = 3, c = 4) # 0
func(a = 1, b = 4, c = 9) # 0

頑張ってます。クリックお願いします。

あまりにも長い変数名はどうなのよ

2020年11月22日 | Python

Python のプログラム規範にあるのかしらないけど，「変数名は実態を表すものにしなさい」とかあったとする。

じゃあ，それにしたがって，

long_long2_long3_long4_long5 = 1

なんて書く馬鹿はいる？（いるみたいなんだなこれが）

long_long2_long3_long4_long8 = long_long2_long3_long4_long5 + 1

と

long_long2_long3_long4_long6 = long_long2_long3_long4_long5 + 1

を見分けろというのか（^_^;）

バグ生産の温床だね。

頑張ってます。クリックお願いします。

PVアクセスランキングにほんブログ村

プロフィール

フォロー中フォローするフォローする

バックナンバー

2025年02月

2025年01月

2024年12月

2024年11月

2024年10月

2024年09月

2024年08月

2024年07月

2024年06月

2024年05月

2024年04月

2024年03月

2024年02月

2024年01月

2023年12月

2023年11月

2023年10月

2023年09月

2023年08月

2023年07月

2023年06月

2023年05月

2023年04月

2023年03月

2023年02月

2023年01月

2022年12月

2022年11月

2022年10月

2022年09月

2022年08月

2022年07月

2022年06月

2022年05月

2022年04月

2022年03月

2022年02月

2022年01月

2021年12月

2021年11月

2021年10月

2021年09月

2021年08月

2021年07月

2021年06月

2021年05月

2021年04月

2021年03月

2021年02月

2021年01月

2020年12月

2020年11月

2020年10月

2020年09月

2020年08月

2020年07月

2020年06月

2020年05月

2020年04月

2020年03月

2020年02月

2020年01月

2019年12月

2019年11月

2019年09月

2019年08月

2019年07月

2019年06月

2019年04月

2018年12月

2018年09月

2018年08月

2018年07月

2018年06月

2018年05月

2018年03月

2018年02月

2018年01月

2017年12月

2017年11月

2017年10月

2017年09月

2017年08月

2017年07月

2017年06月

2017年05月

2017年04月

2017年03月

2017年02月

2017年01月

2016年12月

2016年11月

2016年10月

2016年09月

2016年08月

2016年07月

2016年06月

2016年05月

2016年04月

2016年03月

2016年02月

2016年01月

2015年12月

2015年11月

2015年10月

2015年09月

2015年08月

2015年07月

2015年06月

2015年05月

2015年04月

2015年03月

2015年02月

2015年01月

2014年12月

2014年11月

2014年10月

2014年09月

2014年08月

2014年07月

2014年06月

2014年05月

2014年04月

2014年03月

2014年02月

2014年01月

2013年12月

2013年11月

2013年10月

2013年09月

2013年08月

2013年07月

2013年06月

2013年05月

2013年04月

2013年03月

2013年02月

2013年01月

2012年12月

2012年11月

2012年10月

2012年09月

2012年08月

2012年07月

2012年06月

2012年05月

2012年04月

2012年03月

2012年02月

2012年01月

2011年12月

2011年11月

2011年10月

2011年09月

2011年07月

2011年06月

2011年05月

2011年04月

2011年03月

2011年02月

2011年01月

2010年09月

2010年08月

2010年07月

カレンダー

2020年11月
日	月	火	水	木	金	土
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

前月

次月

goo blog お知らせ

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】みそ汁のシジミは食べていますか？
	訪問者数に応じてdポイント最大1,000pt当たる！
	dポイントが当たる！無料『毎日くじ』