从零开始创建自定义Gym环境——以股票市场为例的量化交易实践指南-tokenim钱包官网下载-token钱包官方网站

参考价值与背景

OpenAI的gym包质量很高，用户能轻松地根据需要调整强化学习代理的配置。在量化交易这一块，不少文章给出了有用的建议，尤其是那些附带的代码，它们能帮助我们搭建一个基础的量化交易平台。对于初学者来说，使用这些现成的环境学习挺合适。但若要解决特定问题，就需要自己构建智能体，而这又需要创建专有的环境。今天咱们就动手，以gym为蓝本，来构建一个股票交易环境。

import gym
from gym import spaces
class CustomEnv(gym.Env):
  """继承自gym的自定义环境类"""
  metadata = {'render.modes': ['human']}
  def __init__(self, arg1, arg2, ...):
    super(CustomEnv, self).__init__()
    # 定义动作空间和状态空间
    # They must be gym.spaces objects
    # Example when using discrete actions:
    #动作空间为离散值时的例子
    self.action_space = spaces.Discrete(N_DISCRETE_ACTIONS)
    # Example for using image as input:
    #使用图像作为输入时的例子
    self.observation_space = spaces.Box(low=0, high=255, shape=
                    (HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
  def step(self, action):
    # Execute one time step within the environment
    #智能体每个时间点执行的动作
    ...
  def reset(self):
    # Reset the state of the environment to an initial state
    #将环境重置为初始化状态
    ...
  def render(self, mode='human', close=False):
    # Render the environment to the screen
    #渲染环境
    ...

创建环境的缘由

交易者通常密切观察股价走势图，图中包含了丰富的技术指标。在强化学习研究中，我们期望智能体在决策前能全面评估多个要素，这些要素需由observation_space部分来全面反映。在股票交易中，仅考虑变量是不够的，还需明确买卖的具体数量。Gym的Box空间正好能够满足这一要求。它不仅能形成各种独立的行为区域，比如买进、卖出或者不采取行动，而且还能够形成一系列连贯的行为区域，这意味着每次买卖股票的数量可以从零到全部进行灵活调整。

关键参数定义

在搭建环境时，动作范围的确立十分关键，这关乎智能体在环境中可能采取的所有动作组合。比如在股票交易环境中，我们要明确其具体种类和各个方面的细节。另外，观察空间包括了智能体在作出决策前必须考虑的所有因素。在分析实际交易中的常见情况时，将股价、成交量等要素考虑在内，可以使得智能体对市场状况有更全面的了解，从而做出更加准确的决策。

CSV数据

奖励函数设计

class StockTradingEnvironment(gym.Env):
  """A stock trading environment for OpenAI gym"""
  metadata = {'render.modes': ['human']}
  def __init__(self, df):
    super(StockTradingEnv, self).__init__()
    self.df = df #接受dataframe数据
    self.reward_range = (0, MAX_ACCOUNT_BALANCE) #奖励函数的范围，0到账户最大盈余
    # Actions of the format Buy x%, Sell x%, Hold, etc.
    self.action_space = spaces.Box( #离散的动作空间
      low=np.array([0, 0]), high=np.array([3, 1]), dtype=np.float16)
    # Prices contains the OHCL values for the last five prices
    #状态空间，包含最近5次的OHCL价格
    self.observation_space = spaces.Box(
      low=0, high=1, shape=(6, 6), dtype=np.float16)

奖励函数同样关键，智能体需关注随时间累积的盈利情况。在构建奖励函数时，一开始对智能体的收益给予一定折扣是明智的，这样智能体在深入优化单一策略前，有足够的时间去探索。具体来说，这样做可以避免它过早陷入局部最优解，转而获得更多机会去尝试不同的策略，进而提高最终找到全局最优解的可能性。

def reset(self):
  # Reset the state of the environment to an initial state
  #重置环境状态为初始状态
  self.balance = INITIAL_ACCOUNT_BALANCE #初始投入资金
  self.net_worth = INITIAL_ACCOUNT_BALANCE #净资产 = 持仓资产+账户余额
  self.max_net_worth = INITIAL_ACCOUNT_BALANCE #最大净资产，随着交易的进行逐渐更新，最终为整个交易过程中的最大净资产
  self.shares_held = 0 #持仓数量
  self.cost_basis = 0 #单只股票持仓成本
  self.total_shares_sold = 0 #总共卖出股份
  self.total_sales_value = 0 #总共卖出金额
 
  # Set the current step to a random point within the data frame
  self.current_step = random.randint(0, len(self.df.loc[:, 'Open'].values) - 6)
  return self._next_observation()

环境的应用与实现

def _next_observation(self):
  # Get the data points for the last 5 days and scale to between 0-1
  #在dataframe 中获取当前交易日后5天的数据并归一化
  frame = np.array([
    self.df.loc[self.current_step: self.current_step +
                5, 'Open'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'High'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Low'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Close'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Volume'].values / MAX_NUM_SHARES,
   ])
  # Append additional data and scale each value to between 0-1
  obs = np.append(frame, [[
    self.balance / MAX_ACCOUNT_BALANCE,
    self.max_net_worth / MAX_ACCOUNT_BALANCE,
    self.shares_held / MAX_NUM_SHARES,
    self.cost_basis / MAX_SHARE_PRICE,
    self.total_shares_sold / MAX_NUM_SHARES,
    self.total_sales_value / (MAX_NUM_SHARES * MAX_SHARE_PRICE),
  ]], axis=0)
  return obs

我们现在进入到了执行既定环境计划的阶段。首先，我们要在这个环境中自行设定action_space和observation_space。同时，需要特别注意的是，这个环境必须能够接受pandas数据类型的参数。随后，_next_observation方法将随机挑选的近五日股市数据与智能体账户资料相融合，并实施标准化操作。然后，环境通过step函数进行操作。智能体每迈出一步，都离不开这个函数的支持。在这一过程中，智能体会采取特定行为，随即计算即时收益，并据此向环境提供下一个状态。

实践测试与展望

def step(self, action):
  # Execute one time step within the environment
  #根据动作action执行一个步骤
  self._take_action(action)
  self.current_step += 1 #当前动作数量+1
  #已经走到头了，重置为0
  if self.current_step > len(self.df.loc[:, 'Open'].values) - 6:
    self.current_step = 0
  #延迟修改器：用来进行计算累计回报的乘数
  delay_modifier = (self.current_step / MAX_STEPS)
  
  reward = self.balance * delay_modifier #相当于强化学习算法里的折扣因子lambda
  done = self.net_worth <= 0 #如果净资产小于0则结束
  obs = self._next_observation() #下一个状态
  return obs, reward, done, {}

现在，我们要建立StockTradingEnv类的样本，并从stable-baselines中挑选算法来试验。不过，我们目前打造出的还只是股票自动交易的自定义gym环境的初步模型。若想将强化学习算法运用到股市中，我们必须持续优化和增强这个环境。这需要金融知识和技术能力的深度结合，也需要不断测试和修正。

def _take_action(self, action):
  # Set the current price to a random price within the time step
  #计算在当前位置的均价
  current_price = random.uniform(
    self.df.loc[self.current_step, "Open"],
    self.df.loc[self.current_step, "Close"])
  #action = [action_type,amount] 即动作类型，该动作操作数量，amount应该是一个0到1的小数
  action_type = action[0]
  amount = action[1]
  if action_type < 1:
    # Buy amount % of balance in shares
    total_possible = self.balance / current_price #先计算剩下的钱最多能买多少股
    shares_bought = total_possible * amount #买这么多，amount应该是一个比例
    prev_cost = self.cost_basis * self.shares_held #之前的单只股票的成本价格*持仓数量，即成本？
    additional_cost = shares_bought * current_price #买下这些股票需要的钱
    self.balance -= additional_cost #从账户余额里面扣除
    self.cost_basis = (prev_cost + additional_cost) / 
                            (self.shares_held + shares_bought) #这个是计算单只股票的成本价格么？
    self.shares_held += shares_bought #持股数量增加
  elif actionType < 2:
    # Sell amount % of shares held
    shares_sold = self.shares_held * amount . 
    self.balance += shares_sold * current_price #账户余额增加
    self.shares_held -= shares_sold #持股数量减少
    self.total_shares_sold += shares_sold #总共卖出的股票数量
    self.total_sales_value += shares_sold * current_price #总共卖出的金额
  self.netWorth = self.balance + self.shares_held * current_price #净资产=账户余额+当前持仓股票的价值
  #净资产增加了
  if self.net_worth > self.max_net_worth:
    self.max_net_worth = net_worth
  #没有持仓，则单只成本价格为0
  if self.shares_held == 0:
    self.cost_basis = 0

大家不妨琢磨琢磨，在努力优化股票交易环境的道路上，哪一块儿可能最为棘手？不妨给这篇文章点个赞，转发一下，也欢迎各位在评论区热烈讨论！

def render(self, mode='human', close=False):
  #what's the meaning of mode?
  # Render the environment to the screen
  profit = self.net_worth - INITIAL_ACCOUNT_BALANCE #利润 = 当前的净资产-初始投入资金
  print(f'Step: {self.current_step}') #当前在哪一步
  print(f'Balance: {self.balance}') #账户余额
  print(f'Shares held: {self.shares_held} #当前持仓
          (Total sold: {self.total_shares_sold})') #总卖出股份数量
  print(f'Avg cost for held shares: {self.cost_basis} #持仓成本
          (Total sales value: {self.total_sales_value})') #总卖出金额
  print(f'Net worth: {self.net_worth} #当前净资产
          (Max net worth: {self.max_net_worth})') #当前最大净资产
  print(f'Profit: {profit}') #利润

相关推荐