课程简介
深度强化学习:原理、算法和应用
目标收益
- 幻灯片算法讲解,结合代码分析
- 深入讲解强化学习各种算法设计、特点和异同
- 结合实际应用举例和和业界趋势分析
- 分析强化学习的演示实现代码
培训对象
1.对增强学习算法原理和应用感兴趣,具有一定编程(Python)和数学基础(线性代数、概率论)的技术人员。
2.对深度学习(deep learning)模型有一定了解为佳
课程内容
环境要求:
- Python 3.5 以上
- GPU:Nvidia GTX 960 以上机器
课程大纲
1. Reinforcement Learning 入门 |
- Reinforcement Learning 特点 - Reinforcement Learning 案例 - Reinforcement Learning 组成 - Rewards - Environment - History and State - Observation - Agent: Policy, Value, Model - 案例:迷宫学习 - Reinforcement Learning 分类 - Value Based - Policy Based - Actor Based - Model Free vs Model Based - Reinforcement Learning 中的顺序决策 sequential decision making 问题 - Learning and Planning - 案例:电子游戏 Atari - Exploration and Exploitation - Prediction and Control |
2. 马尔科夫决策过程 Markov Decision Processes (MDP) |
- Markov Processes 马尔科夫过程 - Markov Reward Processes 马尔科夫回报过程 - Markov Decision Processes 马尔科夫决策过程 - MDP 扩展 |
3. 用动态规划做计划 Planning by Dynamic Programming |
- 策略评估 Policy Evaluation - 策略迭代 Policy Iteration - 价值迭代 Value Iteration - 动态规划扩展 Extension to DP - 压缩映射 Contraction Mapping |
4. 无模型预测 Model-Free Prediction |
- 蒙特卡罗学习 Monte-Carlo Learning - 时间差分学习 Temporal-Difference Learning - TD( λ) 学习 |
5. 无模型控制 Model-Free Control |
- 有策略蒙特卡罗控制 On-Policy Monte-Carlo Control - 有策略时间差分学习 On-Policy Temporal-Difference Learning - 无策略学习 Off-Policy Learning |
6. 价值函数近似 Value Function Approximation |
- 增量方法 Incremental Methods - 批量方法 Batch Methods |
7. 策略梯度法 Policy Gradient |
- 有限差分政策梯度 Finite Difference Policy Gradient - 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient - AC策略梯度 Actor-Critic Policy Gradient * Proximal Policy Optimization (PPO) - the default reinforcement learning algorithm at OpenAI * On-Policy v.s. Off-policy: Importance Sampling - Issue of Importance Sampling - On-Policy -> Off-policy - Add Constraint * PPO / TRPO * Q-Learning - Critic - Target Network - Replay Buffer - Tips of Q-Learning - Double DQN - Dueling DQN - Prioritized Reply - Noisy Net - Distributed Q-function - Rainbow - Q-Learning for Continuous Actions * Actor-Critic - A3C - Advantage Actor-Critic - Path-wise Derivative Policy Gradient * Imitation Learning - Behavior Cloning * Inverse Reinforcement Learning (IRL) - Framework of IRL - IRL and GAN * Sparse Reward - Curiosity - Curriculum Learning - Hierarchical Reinforcement Learning |
8. 整合学习和计划 Integrating Learning and Planning |
- 基于模型的增强学习 Model-Based Reinforcement Learning - 整合架构 Integrated Architectures - 基于模拟的搜索 Simulation-Based Search |
9. 探索与开发 Exploration and Exploitation |
- Multi-Armed Bandits 多臂 Bandit 装置 - Contextual Bandits - MDPs |
10. 强化学习在游戏中的应用 |
- 博弈论概要 - 最小最大搜索 Minimax Search - 自对弈增强学习 Self-Play Reinforcement Learning - 结合强化学习和 Minimax 搜索 - 不完全信息游戏中的强化学习 RL in Imperfect-Information Games |
1. Reinforcement Learning 入门 - Reinforcement Learning 特点 - Reinforcement Learning 案例 - Reinforcement Learning 组成 - Rewards - Environment - History and State - Observation - Agent: Policy, Value, Model - 案例:迷宫学习 - Reinforcement Learning 分类 - Value Based - Policy Based - Actor Based - Model Free vs Model Based - Reinforcement Learning 中的顺序决策 sequential decision making 问题 - Learning and Planning - 案例:电子游戏 Atari - Exploration and Exploitation - Prediction and Control |
2. 马尔科夫决策过程 Markov Decision Processes (MDP) - Markov Processes 马尔科夫过程 - Markov Reward Processes 马尔科夫回报过程 - Markov Decision Processes 马尔科夫决策过程 - MDP 扩展 |
3. 用动态规划做计划 Planning by Dynamic Programming - 策略评估 Policy Evaluation - 策略迭代 Policy Iteration - 价值迭代 Value Iteration - 动态规划扩展 Extension to DP - 压缩映射 Contraction Mapping |
4. 无模型预测 Model-Free Prediction - 蒙特卡罗学习 Monte-Carlo Learning - 时间差分学习 Temporal-Difference Learning - TD( λ) 学习 |
5. 无模型控制 Model-Free Control - 有策略蒙特卡罗控制 On-Policy Monte-Carlo Control - 有策略时间差分学习 On-Policy Temporal-Difference Learning - 无策略学习 Off-Policy Learning |
6. 价值函数近似 Value Function Approximation - 增量方法 Incremental Methods - 批量方法 Batch Methods |
7. 策略梯度法 Policy Gradient - 有限差分政策梯度 Finite Difference Policy Gradient - 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient - AC策略梯度 Actor-Critic Policy Gradient * Proximal Policy Optimization (PPO) - the default reinforcement learning algorithm at OpenAI * On-Policy v.s. Off-policy: Importance Sampling - Issue of Importance Sampling - On-Policy -> Off-policy - Add Constraint * PPO / TRPO * Q-Learning - Critic - Target Network - Replay Buffer - Tips of Q-Learning - Double DQN - Dueling DQN - Prioritized Reply - Noisy Net - Distributed Q-function - Rainbow - Q-Learning for Continuous Actions * Actor-Critic - A3C - Advantage Actor-Critic - Path-wise Derivative Policy Gradient * Imitation Learning - Behavior Cloning * Inverse Reinforcement Learning (IRL) - Framework of IRL - IRL and GAN * Sparse Reward - Curiosity - Curriculum Learning - Hierarchical Reinforcement Learning |
8. 整合学习和计划 Integrating Learning and Planning - 基于模型的增强学习 Model-Based Reinforcement Learning - 整合架构 Integrated Architectures - 基于模拟的搜索 Simulation-Based Search |
9. 探索与开发 Exploration and Exploitation - Multi-Armed Bandits 多臂 Bandit 装置 - Contextual Bandits - MDPs |
10. 强化学习在游戏中的应用 - 博弈论概要 - 最小最大搜索 Minimax Search - 自对弈增强学习 Self-Play Reinforcement Learning - 结合强化学习和 Minimax 搜索 - 不完全信息游戏中的强化学习 RL in Imperfect-Information Games |