
ææµ·å çããæéããããã§ãããããæè¿ãéšäžããã匷ååŠç¿ã§æ¥åæ¹åãã§ããããšèããŠå°ã£ãŠãããŸããæ£çŽãäœãã©ãæè³ããã°å¹æãåºãã®ãèŠåœãã€ããã説æããŠããã ããŸããã

çŽ æŽãããççŒç¹ã§ããïŒå€§äžå€«ãäžç·ã«æŽçããŸããããä»åã¯Î²-DQNãšããææ³ãäŸã«ãæè³å¯Ÿå¹æãšçŸå Žå°å ¥ã®èгç¹ã§ãããããã説æã§ããŸããã

ãŸããβã£ãŠäœã®ããšã§ãããå°éçšèªã¯ãã€ãé£ãããŠâŠâŠã

çŽ æŽãããççŒç¹ã§ããïŒç°¡åã«èšãã°ãβã¯ããããŸã§ã®åãïŒæ¯ãèãïŒãæ°ãšããŠèŠãããã®ãã§ããå ·äœçã«ã¯ãããå Žé¢ã§ã©ã®éžæè¢ãã©ãã ãéžãã ãã®ç¢ºçãæšå®ãã颿°ã§ãè¡åã®åããå¯èŠåã§ãããã§ãã

ãªãã»ã©ãã§ããããå°å ¥ãããšäœãè¯ããªããã§ãããæè³ã«èŠåãã¡ãªããã¯äœã§ãããã

ãã質åã§ãããèŠç¹ãäžã€ã«ãŸãšããŸããã第äžã«ãβã䜿ããšæ¢çŽ¢ãå¹çåããŠæªç¥ã®è¯ãéžæè¢ãèŠã€ãããããªããŸãã第äºã«ãé倧è©äŸ¡ãããŠããéžæè¢ãå®éã«è©Šãããšã§èª€ã£ãæšå®ãä¿®æ£ã§ããŸãã第äžã«ãå®ããŒã¿ã«ãªãéžæè¢ãç¡çã«äœ¿ãããå®å®ããåŠç¿ã«ã€ãªãããŸãã

ããã£ãŠèŠããã«ãåã£ãéžæã°ããç¶ããã®ãçŽããŠãæ°ããè¯ãããæ¹ãèŠã€ãã€ã€ééããæ£ããšããããšïŒ

ãããªãã§ããïŒçŽ æŽãããææ¡åã§ããããå°ãã ãè£è¶³ãããšãβãåºã«è€æ°ã®æ¹éïŒããªã·ãŒïŒãäœãããã®äžãããã®æã ã§æãæå¹ãªæ¹éãéžã¶ä»çµã¿ãå ¥ã£ãŠãããããæè»ã§èª¬æå¯èœãªéçšãå¯èœã«ãªããŸããã

éçšé¢ã§ã¯çŸå Žã®æµæãå¿é ã§ããããŒã¿ãåãä»çµã¿ãç£ç£ã®å·¥æ°ãå¢ããã®ã§ã¯ãªãã§ããã

ãã®äžå®ã¯ãã£ãšãã§ããå®ååãã«èšãã°ãβ-DQNã¯æ¢åã®DQNïŒDeep Q-NetworkïŒã«å°ããªæ©èœãä»ãè¶³ãã ãã§ããã«ã¹ã¯ã©ããã®æ°ã·ã¹ãã ã»ã©ã®å°å ¥è² æ ã¯ãããŸãããç£ç£ã¯æåã¯èŠããŸããã説æå¯èœæ§ãããããçŸå Žã«ãåãå ¥ããããããã§ãã

æè³å¯Ÿå¹æã«ã€ããŠããäžå£°ãã ãããå°ããªå®éšã§å¹æãèŠããªãã£ããæ€éã§ããŸããã

ã§ããŸãããå°èŠæš¡ãªA/Bãã¹ãã§è¡åãã°ãåéããβã®ååžãå€ãããªãã»å ±é ¬ãæ¹åããªããªãæ©æã«äžæ¢ã§ããŸãããŸãã¯éå®ããçŸå Žã§å®å šåŒãä»ããŠè©Šè¡ããã®ãçŸå®çã§ãã

åãããŸãããæåŸã«ç§ã®çè§£ãæŽçããŸããèŠããã«ãβãšããéå»ã®è¡å確çã䜿ã£ãŠæ¢çŽ¢ã®æ¹åãè³¢ãæ±ºããé倧è©äŸ¡ãããéžæãæ€èšŒããŠåŠç¿ãå®å®ããããå°å ¥ã¯æ®µéçã«ã§ããããæè³ãªã¹ã¯ã¯æããããããšããããšã§ééããªãã§ããããã

ãã®ãšããã§ããç°äžå°åãçŽ æŽãããèŠçŽåã§ããããããã°äŒè°ã§äœ¿ããçããã¬ãŒãºãæåŸã«ãæž¡ãããŸããã
1. æŠèŠãšäœçœ®ã¥ã
çµè«ããèšããšããã®ç ç©¶ã¯æ·±å±€åŒ·ååŠç¿ïŒDeep Reinforcement Learning, DRLïŒãçŸå®ã®æ¥åã§äœ¿ããããããããã«ãæ¢çŽ¢æŠç¥ãç°¡æœãã€å¹ççã«æ¹åããå®çšçãªææ³ãæç€ºããŠãããå ·äœçã«ã¯ãæ¢åã®ãã£ãŒãQãããã¯ãŒã¯ïŒDeep Q-Network, DQNïŒã«è¡åååžãæšå®ããβãšããæ¯ãèã颿°ãå ããããšã§ãæ¢çŽ¢ãšå©çšïŒexplorationãšexploitationïŒã®ãã©ã³ã¹ãæçã«å¶åŸ¡ããåŠç¿ã®å®å®æ§ãšå¹çãäž¡ç«ããŠããã
èæ¯ãšããŠã匷ååŠç¿ã¯çè«çãªå¯èœæ§ãé«ãäžæ¹ã§ãæ¢çŽ¢ïŒæªç¥ã®è¯ãè¡åãèŠã€ããããšïŒã«å€ãã®èšç®è³æºãšè©Šè¡ãèŠããçŸå Žé©çšãé£ããã£ããåŸæ¥ææ³ã¯è€éã§ã¿ã¹ã¯ç¹åã«ãªããã¡ã§ãæ±çšæ§ãå®è£ ã®å®¹æãã«æ¬ ããå Žåãå€ãããã®ç¹ã§æ¬ææ³ã¯ã·ã³ãã«ããéžè±ãããæ¢åã®DQNå®è£ ã«æå°éã®æ¹å€ã§å°å ¥ã§ããå®åå¿åã®è§£æ±ºçã§ããã
æ¬ææ³ã®èã¯Î²ã瀺ãããã®ç¶æ ã§éå»ã«ã©ã®è¡åãåã£ããã®ç¢ºçããæ¢çŽ¢èšèšã«å©çšããç¹ã«ãããβãå©çšããŠè¡åããšã®æ¡çšé »åºŠãäœãéžæè¢ãæå³çã«è©Šããããéã«ããŒã¿ã«ã»ãšãã©ååšããªãè¡åããããã¯ããŠé床ãªäžè¬åãé²ãã ãããããšãå¯èœã§ãããããã«ãããæªç¥ã®ææãªæŠç¥çºèŠãšèª€ã£ãé倧è©äŸ¡ã®èšæ£ãåæã«éæãããã
å®åäžã®æå³ã¯æç¢ºã§ãããã¢ãã«ãåæã«éåºŠã«æ¥œèгçãªæ¹éãæšå¥šãããªã¹ã¯ãæžãããããŒã¿ã«è£æã¡ãããæ¹éãåªå ããããšã§ãå°å ¥åŸã®æ¯ãèãã説æå¯èœãã€ç£èŠãããããªããçµæãšããŠå°èŠæš¡ãªè©Šè¡ããæ®µéçã«æ¡åŒµå¯èœãªAIéçšãæåŸ ã§ããã
æ¬ç¯ã§ã¯ãŸãæŠå¿µãšç®çã瀺ããã以éã¯å è¡ç ç©¶ãšã®å·®ãæè¡çèŠç¹ãå®éšçµæãè°è«ãšéçãä»åŸã®å±æãšããé ã§é åºç«ãŠãŠè§£èª¬ããã
2. å è¡ç ç©¶ãšã®å·®å¥åãã€ã³ã
åŸæ¥ã®æ¢çŽ¢åäžçã¯äºã€ã®æœ®æµã«åããããã²ãšã€ã¯è€éãªæ¢çŽ¢ããŒãã¹ãã¢ãã«ããŒã¹ã®æšå®ãå ¥ããŠæåçãäžããæ¹æ³ã§ãããããäžã€ã¯ã·ã³ãã«ãªð-greedyïŒã€ãã·ãã³ã»ã°ãªãŒãã£ïŒãªã©ã®ææ³ãå©çšããŠèšç®å¹çãåªå ããæ¹æ³ã§ãããåè ã¯æ§èœã¯é«ãããã¥ãŒãã³ã°ãšèšç®ã³ã¹ããéããåŸè ã¯è»œéã ãæ¢çŽ¢å¹çã§å£ãæ¬ ç¹ããã£ãã
æ¬ç ç©¶ã®å·®å¥åç¹ã¯äžã€ããã第äžã¯æ±çšæ§ã§ãããβãšããè¡åååžã®æšå®ã¯ã¿ã¹ã¯åºæã®å ±é ¬èšèšãè€éãªæ¢çŽ¢ããŒãã¹ãå¿ èŠãšãããããŸããŸãªç°å¢ã«æšªå±éããããã第äºã¯èšç®ã³ã¹ãã®äœãã§ãããβã¯ãªãã¬ã€ã¡ã¢ãªïŒçµéšåçïŒããçµ±èšãåãã ãã§ãæ¢åã®DQNå®è£ ã«ã»ãšãã©è² æ ãäžããªãã第äžã¯èª¬æå¯èœæ§ã§ãããβã«åºã¥ãæ¹éçæã¯ãªããã®è¡åãéžã¶ãã远跡ãããããçŸå Žã§ã®åæåœ¢æãé²ã¿ãããã
å è¡ç ç©¶ã®äžã«ã¯æ¢çŽ¢å¹çãé«ããããã«è€æ°ã®é«åºŠãªã¡ã«ããºã ãå°å ¥ãããã®ããããããã®å€ãã¯ç¹å®ã®å°é£ãªæ¢çŽ¢ã¿ã¹ã¯ã§ããç䟡ãçºæ®ããªããæ¬ææ³ã¯é床ã«ç¹æ®åãããåºãã¿ã¹ã¯çŸ€ã§å å®ã«æ¹åãäžããç¹ãå®åå®¶ã«ãšã£ãŠæçšã ã
ãããã£ãŠãæè³å€æãšããŠã¯ãå°ããªè¿œå éçºã§å¹æã詊ããããšããæå€§ã®é åã§ãããå€§èŠæš¡ãªã¢ã«ãŽãªãºã æ¹ä¿®ã䌎ããã«A/Bãã¹ãã«çµã¿èŸŒããç¹ãå·®å¥åãã€ã³ãã§ããã
3. äžæ žãšãªãæè¡çèŠçŽ
ãŸãçšèªã®ç¢ºèªããããDeep Reinforcement Learning (DRL) 深局匷ååŠç¿ã¯ããã¥ãŒã©ã«ãããã¯ãŒã¯ãçšããŠè¡å䟡å€ãåŠç¿ããææ³çŸ€ã§ãããæ¬ç ç©¶ãæ¡åŒµããDeep Q-Network (DQN) ãã£ãŒãQãããã¯ãŒã¯ã¯ãç¶æ ãšè¡åã®çµã¿åããã«å¯ŸããŠäŸ¡å€ïŒQå€ïŒãæšå®ãããã®æå€§åãç®æãå€å žçãªã¢ã«ãŽãªãºã ã§ãããããã«Î²ãšããè¡å確çãæšå®ããæ¯ãèã颿°ãä»å ããã®ãæ¬ææ³ã®äžæ žã§ããã
βã¯ãªãã¬ã€ã¡ã¢ãªã«èããããå®éã®è¡åå±¥æŽãããç¶æ ããšã«ããã®è¡åãéžã°ãã確çããæšå®ããŠè¡šçŸãããβã®äœãè¡åã¯éå»ã«ã»ãšãã©è©ŠãããŠããªãããšã瀺ãããããã®è¡åãæå³çã«éžã¶ããšã§æ¢çŽ¢ãä¿ããäžæ¹Î²ã極端ã«äœããã€ãŸãã»ãŒååšããªãè¡åã¯åŠç¿ã®å€æ¿ãæããããQå€ã®åç §ããé€å€ããŠå®å®åãå³ãã
ããã«è€æ°ã®æ¹éãβã«åºã¥ããŠçæããã¡ã¿ã³ã³ãããŒã©ïŒäžäœå¶åŸ¡ïŒã§åãšããœãŒãã«æé©ãªæ¹éãéžã¶ä»çµã¿ãå°å ¥ããŠãããããã«ãããäžã€ã®åºå®æ¹éã«é Œãã®ã§ã¯ãªããç¶æ³ã«å¿ããæè»ãªæ¢çŽ¢ã»å©çšã®åãæ¿ããå¯èœã§ãããèšèšãšããŠã¯èª¬æå¯èœæ§ãšå®è£ å®¹ææ§ãäž¡ç«ãããäœãã«ãªã£ãŠããã
æ°åŠçã«ã¯QæŽæ°ã®éã«Î²ã§ãµããŒããããè¡åã®ã¿ãããŒãã¹ãã©ããããåŒãæç€ºããããããé倧è©äŸ¡ãã€ã¢ã¹ã®æ¯æ£ã«å¯äžãããèŠããã«Î²ã¯ãã©ããŸã§åŠç¿ããŒã¿ãä¿¡çšãããããæç¢ºåãããã£ã«ã¿ãŒãšããŠæ©èœããã®ã ã
4. æå¹æ§ã®æ€èšŒæ¹æ³ãšææ
æ€èšŒã¯ç°¡çŽ ãªç°å¢ããé£ããæ¢çŽ¢åé¡ãŸã§å¹ åºãè¡ãããŠãããæ¯èŒå¯Ÿè±¡ã¯åŸæ¥ã®DQNãè€éãªæ¢çŽ¢åŒ·åææ³ã§ãããè©äŸ¡ææšã¯çޝç©å ±é ¬ãæ¢çŽ¢ã®å®å®åºŠãåŠç¿ã®åæé床ã§ãããå®éšçµæã¯å€ãã®ã¿ã¹ã¯ã§Î²-DQNãããŒã¹ã©ã€ã³ãäžåããç¹ã«ããŒã¿ã«ãã¬ããžãèãå Žé¢ã§ã®å®å®ããæ¹åã確èªãããã
è峿·±ãã®ã¯ãèšç®ãªãŒããŒããããæ¥µããŠå°ããç¹ã§ãããÎ²ã®æšå®ã¯çµéšã®é »åºŠéèšã«è¿ããããåŠç¿éåºŠãæŒç®è² è·ã«å€§ããªåœ±é¿ãäžããªãããããã£ãŠãçŸå Žã§ã®å®éšãã§ãŒãºã«ãããŠããŒããŠã§ã¢å¢åŒ·ã®å¿ èŠæ§ãäœããROIïŒæè³å¯Ÿå¹æïŒãæ¹åãããããã
ãŸããé倧è©äŸ¡ãã€ã¢ã¹ã®èšæ£ã«é¢ããŠã¯ãå®éã«é倧è©äŸ¡ãããè¡åã«ãã£ãŒãããã¯ãå ¥ãããããªããQå€ã®ä¿®æ£ãä¿é²ãããçµæãäžèŠãªæ¢çŽ¢ã³ã¹ããåæžããããããã«ããå šäœãšããŠè¿ éãã€å®å®ããæ¹éæ¹åãå®çŸããŠããã
ãã ãããã¹ãŠã®ã¿ã¹ã¯ã§å§åçã«åªããããã§ã¯ãªããæšæºçãªåé¡ã§ã®åŸ®å°ãªæ¹åããé£ããè¿·è·¯åé¡ã§ã®é¡èãªæ¹åãŸã§ææã«ã°ãã€ãããããå®åå°å ¥ã§ã¯å°èŠæš¡ãªå®èšŒå®éšãèšããå¹æã®æç¡ãæ©æã«è©äŸ¡ããéçšãçŸå®çã§ããã
5. ç ç©¶ãå·¡ãè°è«ãšèª²é¡
äž»èŠãªè°è«ç¹ã¯äžã€ããã第äžã«Î²ã®æšå®ç²ŸåºŠãåŠç¿ææã«äžãã圱é¿ã§ãããβã粟床è¯ãæšå®ã§ããªãç¶æ ã§ã¯ã誀ã£ãæ¹åã«æ¢çŽ¢ãèªå°ããããªã¹ã¯ãæ®ãã第äºã«ããªãã¬ã€ã¡ã¢ãªã®åããÎ²ãæªããããšãããããã®ç¹ãã©ãè£æ£ããããä»åŸã®èª²é¡ã§ããã第äžã«ãåçã«å€åããå®ç°å¢ã«å¯ŸããŠÎ²ã®æŽæ°ãšæ¹éåæ¿ãåŠäœã«çŽ æ©ãè¡ãããéçšé¢ã®éµã§ããã
æè¡çã«ã¯ãç¶æ 空éãæ¥µããŠå€§ããå Žåã«Î²ãã©ãå¹ççã«æšå®ãããã課é¡ãšãªãã屿çãªé »åºŠéèšã§ã¯ååãªäžè¬åãæåŸ ã§ããªããããè¿äŒŒææ³ãã¯ã©ã¹ã¿ãªã³ã°ã䜵çšããβæšå®ã®æ€èšãå¿ èŠã ãå ããŠãè€æ°æ¹éãéžæããã¡ã¿ã³ã³ãããŒã©ã®èšèšã¯ã¿ã¹ã¯äŸåæ§ãæé€ããŠæ±çšçã«ããé£ãããå«ãã
å«çã»éçšé¢ã®åé¡ãèŠéãããªããβãå©çšããæ¢çŽ¢ã§çŸå Žã®æäœãå®å šèŠä»¶ã«æµè§Šããªããããã¬ãŒãã¬ãŒã«ã人éã®ç£ç£ãæç€ºçã«èšããå¿ èŠããããç¹ã«è£œé çŸå Žã顧客æ¥ç¹ã§ã®å°å ¥æã«ã¯æ®µéçãªå°å ¥ãšç°åžžæã®å³æåæ¢ã®ä»çµã¿ãå¿ é ã§ããã
æåŸã«ãçè«çè§£æã®æ·±ããä»åŸã®ä¿¡é Œæ§ã«çŽçµãããβ-DQNã®çµéšçæåã¯ç€ºãããããéçå¢çãææªã±ãŒã¹ã®æ¯ãèãã«ã€ããŠã®çè«çä¿èšŒãæªæŽåã§ãããããç ç©¶ã³ãã¥ããã£ã§ã®è°è«ãšæ€èšŒãåŒãç¶ãå¿ èŠã§ããã
6. ä»åŸã®èª¿æ»ã»åŠç¿ã®æ¹åæ§
ä»åŸã®å®åçãªçŠç¹ã¯äžã€ã«éçŽãããã第äžã¯Î²æšå®ã®å ç¢æ§åäžã§ãããããŒã¿åããè£æ£ããçµ±èšçææ³ã衚çŸåŠç¿ãšã®çµåããæ€èšãã¹ãã§ããã第äºã¯å°èŠæš¡å®éšã®æšæºåã§ãããå°å ¥åã«è¿ éã«å¹æãæ€èšŒã§ãããããã³ã«ãæŽåããããšãæãŸããã第äžã¯å®å šæ§ãšèª¬ææ§ã®åŒ·åã§ãããçŸå Žã§ã®åæåœ¢æãå©ããå¯èŠåãšéçšã«ãŒã«ã®æŽåãå¿ èŠã§ããã
ç ç©¶è ãå®åå®¶ãåç §ãã¹ãè±èªããŒã¯ãŒãã¯æ¬¡ã®éãã§ããã”beta-DQN”, “behavioral policy estimation”, “exploration versus exploitation”, “overestimation bias correction”, “replay buffer coverage”ããããã®èªå¥ã§æç®æ€çŽ¢ãè¡ãã°ãæ¬ç ç©¶åšèŸºã®è°è«ã远ãããšãã§ããã
æçã®åŠç¿ããŒããããã¯ããŸãæ¢åã®DQNã®å°ããªå®è£ ãçšæããçµéšããŒã¿ã®ãã°ãåããªããβã®ç°¡ææšå®ãè¡ã£ãŠã¿ãããšã§ãããããããβã«åºã¥ãæ¹éãæ®µéçã«è¿œå ããA/Bãã¹ãã§å¹æããããã確èªããéçšãæšå¥šãããéèŠãªã®ã¯æ®µéçãã€å¯èŠåãããè©äŸ¡ã§ããã
æåŸã«ãçµå¶å±€ã泚ç®ãã¹ãã¯ãå°ããªæ¹ä¿®ã§å®åäžã®äžç¢ºå®æ§ãæžãããããšããç¹ã§ãããæè³å€æã¯æ®µéçå®éšã«ãã£ãŠãªã¹ã¯ãéå®ããåŸãããããŒã¿ã§æ¬¡ã®æ¡åŒµã倿ããã®ãåççã§ããã
äŒè°ã§äœ¿ãããã¬ãŒãºé
ããŸãã¯éå®é åã§Î²ã䜿ã£ãA/Bãã¹ããè¡ããæåŸ å€ãäžããã°æ®µéçã«å±éããŸããããã
ãβã¯éå»ã®è¡åé »åºŠãç€ºãææšã§ããããçŸå Žã®ãã°æŽåãå æ±ºã«ãªããŸããã
ãé倧è©äŸ¡ãããéžæè¢ã«å®éã®ãã£ãŒãããã¯ãäžããããšã§ãã¢ãã«ã®èª€å€å®ãæ©æã«æ¯æ£ã§ããŸããã
