Why Economic Policies are Wrong and How to Fix Them - Why Words Alone May Not Be Enough

When explaining complex systems and methodologies, simple words can sometimes generate skepticism. This is because language may introduce vagueness or obscure technical complexity. To address this (originating from https://www.financecs.com/2024/09/21/why-economic-policies-are-wrong-and-how-to-fix-them/), I have implemented a basic version of the system described in the previous article. This system integrates three key components:

Data & Visualization
Principal Component Analysis (PCA) & Productivity Normalization
Reinforcement Learning for optimal group policy

This article assumes some technical knowledge and will aim to simplify the explanation wherever possible. Through these methods, I will illustrate how the system works with practical examples.

Generating Sample Data: Building a Synthetic Population

The first step in building the system is to generate a synthetic dataset that mimics a small population, in this case, 20 individuals with various personal attributes. Each individual is characterized by 20 unique features that relate to their physical, cognitive, and social behaviors. This data helps simulate real-world scenarios where each person’s attributes are influenced by others and determine their contribution to group productivity.

Main Features:

IQ: Serves as a central factor in determining several other attributes. Higher IQ indicates better problem-solving abilities and innovation potential.
Age: Influences how quickly individuals learn new skills (younger individuals generally learn faster).
Physical Activity & Health Status: Physical activity directly impacts health, which then affects happiness and productivity.
Sampled Knowledge and Potential: These are generated based on the IQ and physical activity of individuals. For example, those with higher IQ and more physical activity have higher potential and knowledge levels.
Innovation Capabilities: Innovation is tied to IQ and physical activity, where more active and cognitively capable individuals are likely to be more innovative.

Other attributes like smoker status, income, societal coherence, and happiness are derived based on a combination of the main features. For example:

Smoker Status: More likely in individuals with lower health scores.
Income: Higher IQ and better health lead to higher income.
Happiness: Primarily influenced by an individual’s health and physical activity.

Why This Data Matters:
These attributes form the backbone of each individual’s behavior in the simulation. By generating data with specific interdependencies (e.g., IQ affecting knowledge and innovation), we can create a realistic model that captures individual differences and their potential impacts on group productivity. This serves as the starting point for applying advanced techniques like Principal Component Analysis (PCA) and reinforcement learning, which help refine the system’s policy optimization.

import numpy as np
import pandas as pd

SEED = 42
# Define the number of individuals (e.g., 20 people in a small village) and features (20 features)
NUMBER_OF_INDIVIDUALS = 20
NUMBER_OF_FEATURES = 20


def generate_sample_data(seed: int = SEED, n_individuals: int = NUMBER_OF_INDIVIDUALS, n_features: int = NUMBER_OF_FEATURES):
    # Setting a random seed for reproducibility
    np.random.seed(seed)

    # Generate dummy data for 20 individuals with 20 features
    # Features: Sampled knowledge, Sampled potential, Specific Topic Knowledge, Learning capacity, etc.

    # Generate each feature based on a specific range or probability distribution
    # Step 1: Generate IQ, Age, Physical Activity, and Health Status as main factors
    IQ = np.random.normal(100, 15, n_individuals)  # IQ scores (mean 100, SD 15)
    Age = np.random.randint(18, 65, n_individuals)  # Age of individuals
    Physical_activity = np.random.uniform(0, 1, n_individuals)  # Physical activity score (0-1)
    Health_Status = np.clip(Physical_activity * 0.7 + np.random.uniform(0, 0.3, n_individuals), 0, 1)  # Health influenced by activity

    # Step 2: Generate other features influenced by IQ, Age, Physical Activity, and Health Status
    Sampled_knowledge = np.clip(IQ / 130, 0, 1)  # Normalize IQ to be within the range [0, 1] to reflect knowledge
    Sampled_potential = np.clip(IQ / 150 + Physical_activity * 0.5, 0, 1)  # Higher IQ and physical activity indicate higher potential
    Specific_topic_knowledge = np.clip((IQ - 100) / 50 + np.random.uniform(0, 1, n_individuals), 0, 4).astype(int)  # Related to IQ, skewed towards higher IQ
    Learning_capacity_velocity = np.clip(0.6 - (Age / 100) + np.random.normal(0.05, 0.1, n_individuals), 0, 1)  # Learning decreases with age
    Innovation_capabilities = np.clip(IQ / 120 + Physical_activity * 0.3, 0, 1)  # Higher IQ and activity influence innovation
    Lateral_resolution_abilities = np.clip(IQ / 130 + np.random.normal(0.1, 0.1, n_individuals), 0, 1)  # IQ influences problem-solving abilities

    # Step 3: Generate social-related features based on individual factors
    Smoker = (Health_Status < 0.4).astype(int)  # Lower health indicates higher likelihood of smoking
    Sugary_food_intake = np.clip(10 - Health_Status * 10 + np.random.normal(0, 2, n_individuals), 0, 10).astype(int)  # Lower health -> higher sugary food intake
    Societal_coherence = np.clip((100 - Age) / 100 + Health_Status * 0.5, 0, 1)  # Coherence with societal needs influenced by age and health
    Social_media_interaction = np.clip(1 - Age / 70 + np.random.uniform(0, 0.2, n_individuals), 0, 1)  # Younger people more active on social media
    Demographics = np.random.randint(0, 5, n_individuals)  # Categorical demographic variable
    DNA_mapping = np.random.uniform(0, 1, n_individuals)  # Hypothetical, unaffected by main factors
    Happiness = np.clip(Health_Status * 0.6 + Physical_activity * 0.4, 0, 1)  # Happiness influenced by health and activity
    Education_Level = (IQ > 100).astype(int) + (IQ > 120).astype(int)  # Education related to IQ
    Income = (50000 + IQ * 500 + (Health_Status - 0.5) * 10000).astype(int)  # Higher IQ and health lead to higher income
    Job_satisfaction = np.clip(0.5 + (Income / 100000) + Health_Status * 0.3, 0, 1)  # Job satisfaction tied to income and health

    # Step 4: Combine all the data into a DataFrame
    data = {
        'Individual_ID': np.arange(1, n_individuals + 1),
        'Sampled_knowledge': Sampled_knowledge,
        'Sampled_potential': Sampled_potential,
        'Specific_topic_knowledge': Specific_topic_knowledge,
        'Learning_capacity_velocity': Learning_capacity_velocity,
        'Innovation_capabilities': Innovation_capabilities,
        'Lateral_resolution_abilities': Lateral_resolution_abilities,
        'Health_Status': Health_Status,
        'Smoker': Smoker,
        'Sugary_food_intake': Sugary_food_intake,
        'IQ': IQ,
        'Age': Age,
        'Societal_coherence': Societal_coherence,
        'Social_media_interaction': Social_media_interaction,
        'Demographics': Demographics,
        'DNA_mapping': DNA_mapping,
        'Happiness': Happiness,
        'Education_Level': Education_Level,
        'Income': Income,
        'Physical_activity': Physical_activity,
        'Job_satisfaction': Job_satisfaction,
    }

    # Create the DataFrame
    df = pd.DataFrame(data)
    return df

Visualizing Sample Data: Applying PCA and Classifying Productivity

Once we have generated the synthetic data, the next step is to visualize and analyze it using Principal Component Analysis (PCA). PCA helps reduce the dimensionality of the dataset (which has 20 features for each individual) while preserving the most important patterns and relationships. In this case, we reduce the dataset to 3 principal components for easy 3D visualization.

Step-by-Step Breakdown:

PCA Transformation:
Using PCA, we reduce the 20-dimensional dataset into 3 dimensions. These new axes, called principal components, represent the directions of maximum variance in the dataset, meaning they capture the most important relationships between individuals’ attributes.
3D Visualization:
The three principal components are then plotted in a 3D space. Each individual is represented as a point in this space, where similar individuals (in terms of their knowledge, innovation, health, etc.) are closer to one another.
Classifying Productivity Groups:
To better understand the distribution of productivity, we classify individuals into three productivity groups:

Most Productive
Average Productive
Least Productive This classification is based on a composite productivity score, which is the average of normalized features such as Sampled Knowledge, Sampled Potential, Innovation Capabilities, and Learning Capacity. Individuals are grouped based on thresholds derived from the productivity score.

Color-Coding the Groups:
We assign different colors to each group:

Green for the most productive,
Yellow for the average productive, and
Red for the least productive. These color-coded individuals are then plotted in the 3D PCA space, allowing us to visually identify the distribution of productivity across the population.

Why PCA and Classification Matter:
PCA helps in simplifying the complexity of the dataset, making it easier to identify patterns and relationships that are otherwise hidden in the high-dimensional data. The classification of individuals into productivity groups allows us to evaluate their contributions to overall productivity and helps in formulating strategies to improve the performance of the least productive groups. This is crucial for the next step, where reinforcement learning is applied to find the optimal policies for each group.

The visualization also provides insights into how individuals differ in terms of their productivity, innovation, and learning abilities, and serves as a foundation for forecasting and policy learning.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from generate_sample_data import generate_sample_data
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
from matplotlib.lines import Line2D
from sklearn.linear_model import LinearRegression

def visualize_PCA_3D(df):
    # We will use PCA (Principal Component Analysis) to reduce dimensionality to 3D for visualization
    pca = PCA(n_components=3)
    pca_result = pca.fit_transform(df.iloc[:, 1:])  # Skip the Individual_ID for PCA

    # Extract the three principal components
    pca_1 = pca_result[:, 0]
    pca_2 = pca_result[:, 1]
    pca_3 = pca_result[:, 2]

    # 3D Visualization
    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')

    # Plot individuals in 3D space based on PCA-reduced data
    ax.scatter(pca_1, pca_2, pca_3, c='b', marker='o')

    # Labeling the axes
    ax.set_xlabel('Principal Component 1')
    ax.set_ylabel('Principal Component 2')
    ax.set_zlabel('Principal Component 3')

    # Title of the plot
    ax.set_title('3D Visualization of Individuals Based on Principal Components')

    plt.show()
    return pca_1, pca_2, pca_3



def visualize_PCA_3D_with_productivity(df, pca_1, pca_2, pca_3):
    # Normalize the selected features using MinMaxScaler (values between 0 and 1)
    scaler = MinMaxScaler()
    features_to_normalize = ['Sampled_knowledge', 'Sampled_potential', 'Innovation_capabilities', 'Learning_capacity_velocity']
    df[features_to_normalize] = scaler.fit_transform(df[features_to_normalize])

    # Recalculate the productivity score using normalized values
    df['Productivity_score'] = (df['Sampled_knowledge'] + df['Sampled_potential'] + 
                                df['Innovation_capabilities'] + df['Learning_capacity_velocity']) / 4

    # Reclassify individuals into three groups: Most Productive, Average Productive, Least Productive
    threshold_high = df['Productivity_score'].quantile(0.67)
    threshold_low = df['Productivity_score'].quantile(0.33)

    df['Productivity_group'] = pd.cut(df['Productivity_score'], 
                                    bins=[-np.inf, threshold_low, threshold_high, np.inf], 
                                    labels=['Least Productive', 'Average Productive', 'Most Productive'])

    # Now visualize the classified groups in the 3D PCA plot again
    group_colors = {'Most Productive': 'g', 'Average Productive': 'y', 'Least Productive': 'r'}
    colors = df['Productivity_group'].map(group_colors)

    # 3D Visualization
    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')

    # Plot individuals in 3D space based on PCA-reduced data, colored by productivity group
    sc = ax.scatter(pca_1, pca_2, pca_3, c=colors, marker='o')

    # Labeling the axes
    ax.set_xlabel('Principal Component 1')
    ax.set_ylabel('Principal Component 2')
    ax.set_zlabel('Principal Component 3')

    # Add a legend for productivity groups
    
    legend_elements = [Line2D([0], [0], marker='o', color='w', label='Most Productive', markerfacecolor='g', markersize=10),
                    Line2D([0], [0], marker='o', color='w', label='Average Productive', markerfacecolor='y', markersize=10),
                    Line2D([0], [0], marker='o', color='w', label='Least Productive', markerfacecolor='r', markersize=10)]
    ax.legend(handles=legend_elements, loc='best')

    # Title of the plot
    ax.set_title('3D Visualization of Individuals Classified by Productivity (Normalized)')

    plt.show()
    return df


def visualize_productivity_forecast(df_new):
    # Prepare the data for forecasting based on productivity scores

    # We'll use 'Productivity_score' from the previous classification (already normalized and influenced by key factors)

    # Group individuals by their productivity group
    least_productive = df_new[df_new['Productivity_group'] == 'Least Productive']
    average_productive = df_new[df_new['Productivity_group'] == 'Average Productive']
    most_productive = df_new[df_new['Productivity_group'] == 'Most Productive']

    # Calculate the current mean productivity score for each group
    mean_least_productive = least_productive['Productivity_score'].mean()
    mean_average_productive = average_productive['Productivity_score'].mean()
    mean_most_productive = most_productive['Productivity_score'].mean()

    # Create a small time horizon for forecasting (e.g., 10 time periods)
    time_periods = np.arange(10).reshape(-1, 1)

    # Define the expected improvement rates for each group
    # Least productive will improve faster, average moderately, most productive slower but steady
    growth_rate_least = 0.05
    growth_rate_average = 0.03
    growth_rate_most = 0.01
    least_productive = df_new[df_new['Productivity_group'] == 'Least Productive']
    average_productive = df_new[df_new['Productivity_group'] == 'Average Productive']
    most_productive = df_new[df_new['Productivity_group'] == 'Most Productive']


    # Forecast productivity over 10 periods
    forecast_least = mean_least_productive + growth_rate_least * time_periods
    forecast_average = mean_average_productive + growth_rate_average * time_periods
    forecast_most = mean_most_productive + growth_rate_most * time_periods
    # Visualize the forecasted productivity for each group
    plt.figure(figsize=(10, 6))
    plt.plot(time_periods, forecast_least, label='Least Productive', color='r')
    plt.plot(time_periods, forecast_average, label='Average Productive', color='y')
    plt.plot(time_periods, forecast_most, label='Most Productive', color='g')
    plt.xlabel('Time Period')
    plt.ylabel('Productivity Score')
    plt.title('Forecasted Productivity Over Time (By Group)')
    plt.legend()
    plt.grid(True)
    plt.show()

df = generate_sample_data()
pca_1, pca_2, pca_3 = visualize_PCA_3D(df)
df_new = visualize_PCA_3D_with_productivity(df, pca_1, pca_2, pca_3)
visualize_productivity_forecast(df_new)

In this context, PCA components represent the major patterns or relationships in the data that account for the most variation between individuals. Each principal component is a linear combination of the original features (e.g., knowledge, health, innovation, etc.). Here’s a breakdown of how to interpret these PCA components:

PCA Component 1: Overall Knowledge and Potential

What it captures: This component likely captures the overall knowledge and potential of individuals. Since features like IQ, sampled knowledge, and sampled potential are some of the strongest contributors, this component may measure an individual’s capacity for productivity and learning.
Interpretation: Individuals with a high score in this component are likely more educated, have greater potential, and are capable of higher productivity. This dimension could differentiate individuals who have a strong foundation of knowledge from those with lower potential.

PCA Component 2: Health and Physical Well-being

What it captures: This component is likely dominated by features related to health status, physical activity, and their associated effects on productivity. Health plays a critical role in determining an individual’s well-being and capability to perform.
Interpretation: Individuals scoring high on this component are healthier and more physically active, contributing positively to their productivity. Conversely, lower scores might indicate poorer health, possibly associated with behaviors like smoking or higher sugary food intake.

PCA Component 3: Innovation and Learning Velocity

What it captures: This component likely focuses on innovation capabilities and learning capacity velocity. It measures how quickly an individual can adapt and innovate based on their current state of knowledge and abilities.
Interpretation: A higher score on this component means the individual has a strong capacity to innovate and learn rapidly. Lower scores suggest individuals may struggle with adapting to new challenges or coming up with creative solutions.

Overall Role of PCA Components:

Component 1: Likely reflects cognitive capabilities and knowledge (e.g., potential for productivity).
Component 2: Reflects physical well-being and health, important for maintaining consistent productivity.
Component 3: Measures the ability to innovate and adapt, important for future growth and development.

These three components, combined, help summarize the entire dataset into a simpler form, giving us a better understanding of what drives productivity across individuals.

Reinforcement Learning: Optimal Policy Determination

Reinforcement Learning (RL) is the key to developing an adaptive system that can learn and recommend optimal policies for individuals or groups. In the context of our productivity model, RL helps us discover the best strategies for improving productivity across knowledge, health, and innovation by adjusting actions such as educational investments, health initiatives, and innovation incentives.

The Role of Reinforcement Learning in this Model

RL works by interacting with the environment (our productivity simulation) and learning from feedback (rewards). The goal is to maximize long-term rewards by making smart decisions about how to distribute resources and taxes across three groups: the least, average, and most productive.

Here’s how the RL process works in this model:

States: Represent the current productivity levels of each group, defined by four main dimensions: knowledge, health, innovation, and potential private consumption.
Actions: The actions consist of adjustments to education, health, and innovation investments. These actions determine how resources are allocated to each group and can either increase or decrease those dimensions.
Rewards: The reward is based on the improvement of productivity while ensuring balanced growth across all dimensions. A penalty is applied for imbalances between knowledge, health, and innovation, as well as for extreme swings in actions.
Policy Learning: The RL agent learns an optimal policy by adjusting actions over time, seeking to maximize productivity in a balanced way across the three dimensions.

Key Elements of the Reinforcement Learning Approach:

State Space: The RL agent observes the current state of the productivity environment. In our case, each group (least, average, and most productive) has its own values for knowledge, health, innovation, and potential. The state space is the combined values across all groups and these four dimensions.
Action Space: The agent can choose actions from a continuous action space that adjusts education, health, and innovation investments. The sum of these actions is zero (ensuring balanced decisions). The agent must decide how much to allocate to each dimension while penalizing extreme changes.
Reward Function:

Positive rewards are given for improvements in knowledge, health, and innovation.
Balanced growth: If all dimensions grow equally, the agent receives an additional reward.
Penalties: If there’s an imbalance in growth (e.g., knowledge grows too fast while health declines), or if actions are extreme (e.g., too large of an investment in one dimension), penalties are applied.
Potential Consumption: Additionally, potential private consumption increases if the agent efficiently balances lower public investments (like taxes) while improving productivity.

Policy Optimization: Over hundreds of episodes (or iterations), the agent improves its decision-making by exploring different actions and learning from their outcomes. The goal is to find a policy that maximizes long-term productivity for all groups while maintaining balance across all dimensions.

Training the Agent

Through the training process, the RL agent interacts with the environment, receiving rewards and adjusting its policy to improve productivity. The agent explores different actions and learns the most effective way to allocate resources by taking into account both short-term gains and long-term growth.

For example:

The agent might learn that increasing education for the least productive group can have significant long-term benefits, but doing so at the expense of health might lead to burnout, reducing potential.
Similarly, too much innovation without sufficient investment in knowledge might lead to rapid improvements in creativity but a lack of foundation in core skills.

By balancing these trade-offs, the RL agent discovers the optimal allocation of resources that drives productivity growth for the least, average, and most productive groups.

import gym
from gym import spaces
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque
import matplotlib.pyplot as plt

# Redefine the Productivity Environment with Continuous Actions
class ProductivityEnvContinuous(gym.Env):
    def __init__(self):
        super(ProductivityEnvContinuous, self).__init__()

        # Define continuous action space: each action (tax, education, health) is between [-1, 1]
        self.action_space = spaces.Box(low=-1, high=1, shape=(3,), dtype=np.float32)

        # Observation space: 3 groups (least, average, most productive), with underlying states of [knowledge, health, innovation, potential]
        self.observation_space = spaces.Box(low=0, high=1, shape=(3, 4), dtype=np.float32)

        # Initialize the state for each group
        self.state = np.array([[0.4, 0.6, 0.5, 0.4],  # Least productive
                               [0.6, 0.7, 0.6, 0.6],  # Average productive
                               [0.8, 0.7, 0.7, 0.7]])  # Most productive

        # Track the number of steps
        self.steps = 0

    def step(self, action):
        # action = [education, health, innovation], continuous between [-1, 1]
        education, health, innovation = action
        previous_state = self.state.copy()

        # Apply changes based on the taxation mechanism
        for i in range(3):  # Loop over least productive, average productive, most productive groups
            # Apply investments to each dimension with positive growth in mind
            self.state[i, 0] += 0.5 * education * (3 - self.state[i, 0])  # Knowledge (Productivity) grows more slowly at max
            self.state[i, 1] += 0.5 * health * (3  - self.state[i, 1])      # Health
            self.state[i, 2] += 0.5 * innovation * (3  - self.state[i, 2])  # Innovation
            self.state[i, 3] += 0.5 * (education + innovation + health) * (3  - self.state[i, 3])  # Potential Private Consumption the individuals can increase with lower public spending

        # Cap values between 0 and 1
        self.state = np.clip(self.state, 0, 5)

        # Calculate imbalance penalty to discourage uneven growth
        imbalance_penalty = 0
        for i in range(3):
            mean_value = np.mean(self.state[i, :3])  # Mean of knowledge, health, and innovation
            imbalance_penalty += 0.1 * np.sum(np.abs(self.state[i, :3] - mean_value))  # Smaller penalty for imbalance

        # Encourage balanced growth by checking improvement across all dimensions
        improvement = np.sum(self.state - previous_state)
        
        # Reward for balanced growth if all dimensions grow equally
        balanced_growth_bonus = 0
        if np.all(self.state - previous_state > 0):  # Ensure all dimensions grew
            balanced_growth_bonus = 0.05 * np.mean(self.state - previous_state)  # Reward for equal improvement

        # Final reward: improvement - imbalance penalty + balanced growth bonus
        reward = improvement + balanced_growth_bonus - imbalance_penalty - 0.05 * np.sum(np.square(action))  # Slight penalty for extreme actions

        # Step termination condition (e.g., after 50 steps)
        self.steps += 1
        done = self.steps >= 50

        return self.state, reward, done, {}











    def reset(self):
        # Reset the state and steps
        self.state = np.array([[0.4, 0.6, 0.5, 0.4],  
                               [0.6, 0.7, 0.6, 0.6],  
                               [0.8, 0.7, 0.7, 0.7]])  
        self.steps = 0
        return self.state


class ActorNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(ActorNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        
        # Generate raw (unbounded) actions
        actions = self.fc3(x)
        
        # Subtract the mean to ensure the sum of actions is zero
        actions = actions - actions.mean(dim=1, keepdim=True)
        
        # Scale the actions to ensure they are within [-1, 1] and still sum to 0
        max_abs_action = actions.abs().max(dim=1, keepdim=True)[0]  # Find the max absolute value for scaling
        scaled_actions = actions / max_abs_action.clamp(min=1.0)    # Normalize and avoid division by zero

        return scaled_actions




# Define the Critic network (to evaluate actions)
class CriticNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(CriticNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size + action_size, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 1)

    def forward(self, state, action):
        x = torch.cat([state, action], dim=1)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# Replay buffer to store experiences
class ReplayBuffer:
    def __init__(self, buffer_size, batch_size):
        self.buffer = deque(maxlen=buffer_size)
        self.batch_size = batch_size

    def add(self, experience):
        self.buffer.append(experience)

    def sample(self):
        experiences = random.sample(self.buffer, self.batch_size)
        states, actions, rewards, next_states, dones = zip(*experiences)
        return np.array(states), np.array(actions), rewards, np.array(next_states), dones

    def __len__(self):
        return len(self.buffer)

# Define the DDPG agent
class DDPGAgent:
    def __init__(self, state_size, action_size, lr_actor=0.001, lr_critic=0.001, gamma=0.99, tau=0.005):
        self.state_size = state_size
        self.action_size = action_size
        self.actor_local = ActorNetwork(state_size, action_size)
        self.actor_target = ActorNetwork(state_size, action_size)
        self.critic_local = CriticNetwork(state_size, action_size)
        self.critic_target = CriticNetwork(state_size, action_size)
        self.optimizer_actor = optim.Adam(self.actor_local.parameters(), lr=lr_actor)
        self.optimizer_critic = optim.Adam(self.critic_local.parameters(), lr=lr_critic)
        self.memory = ReplayBuffer(10000, 64)
        self.gamma = gamma
        self.tau = tau  # Soft update parameter

    def act(self, state):
        state = torch.FloatTensor(state).unsqueeze(0)
        with torch.no_grad():
            action = self.actor_local(state).cpu().data.numpy().flatten()
        return action

    def step(self, state, action, reward, next_state, done):
        self.memory.add((state, action, reward, next_state, done))
        if len(self.memory) > self.memory.batch_size:
            self.learn()

    def learn(self):
        states, actions, rewards, next_states, dones = self.memory.sample()

        # Convert to tensors
        states = torch.FloatTensor(states)
        actions = torch.FloatTensor(actions)
        rewards = torch.FloatTensor(rewards).unsqueeze(1)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones).unsqueeze(1)

        # Get actions from actor target network
        next_actions = self.actor_target(next_states)

        # Get Q values from critic target network
        Q_targets_next = self.critic_target(next_states, next_actions)

        # Compute Q targets for current states
        Q_targets = rewards + (self.gamma * Q_targets_next * (1 - dones))

        # Get expected Q values from local critic network
        Q_expected = self.critic_local(states, actions)

        # Compute critic loss
        critic_loss = nn.MSELoss()(Q_expected, Q_targets)

        # Minimize the loss
        self.optimizer_critic.zero_grad()
        critic_loss.backward()
        self.optimizer_critic.step()

        # Compute actor loss (maximize Q values for the actions chosen by the actor)
        actions_pred = self.actor_local(states)
        actor_loss = -self.critic_local(states, actions_pred).mean()

        # Minimize actor loss
        self.optimizer_actor.zero_grad()
        actor_loss.backward()
        self.optimizer_actor.step()

        # Soft update of target networks
        self.soft_update(self.actor_local, self.actor_target)
        self.soft_update(self.critic_local, self.critic_target)

    def soft_update(self, local_model, target_model):
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)

# Set up the environment and agent
env = ProductivityEnvContinuous()  # Custom environment with continuous actions
state_size = env.observation_space.shape[1] * env.observation_space.shape[0]  # 3 groups, 4 state variables each
action_size = env.action_space.shape[0]  # Continuous action size

agent = DDPGAgent(state_size, action_size)

# Track reward evolution over episodes
reward_history = []

# Train the agent
num_episodes = 500
max_steps = 50

# Create separate lists to track knowledge, health, innovation, and potential
knowledge_history = []
health_history = []
innovation_history = []
potential_history = []
composite_productivity_history = []

for episode in range(num_episodes):
    state = env.reset().flatten()  # Flatten the state into a 1D array
    total_reward = 0
    knowledge_sum, health_sum, innovation_sum, potential_sum, composite_productivity_sum = 0, 0, 0, 0, 0

    for step in range(max_steps):
        action = agent.act(state)  # Select action
        next_state, reward, done, _ = env.step(action)  # Continuous action applied
        next_state = next_state.flatten()

        # Store experience and learn
        agent.step(state, action, reward, next_state, done)

        state = next_state
        total_reward += reward

        # Sum the dimensions over the steps in an episode
        knowledge_sum += np.mean(state[::4])     # Mean of knowledge across groups
        health_sum += np.mean(state[1::4])       # Mean of health across groups
        innovation_sum += np.mean(state[2::4])   # Mean of innovation across groups
        potential_sum += np.mean(state[3::4])    # Mean of potential across groups
        
        # Calculate composite productivity as an average of knowledge, health, and innovation
        composite_productivity_sum += np.mean((state[::4] + state[1::4] + state[2::4]) / 3)

        if done:
            break

    # Track average values over the episode
    knowledge_history.append(knowledge_sum / max_steps)
    health_history.append(health_sum / max_steps)
    innovation_history.append(innovation_sum / max_steps)
    potential_history.append(potential_sum / max_steps)
    composite_productivity_history.append(composite_productivity_sum / max_steps)

    reward_history.append(total_reward)

    if episode % 10 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward}")
print("Training completed.")

# Plot reward evolution
plt.figure(figsize=(12, 8))

plt.subplot(2, 3, 1)
plt.plot(knowledge_history, label='Knowledge (Productivity)')
plt.xlabel('Episode')
plt.ylabel('Average Knowledge')
plt.title('Knowledge Evolution Over Time')
plt.grid(True)

plt.subplot(2, 3, 2)
plt.plot(health_history, label='Health')
plt.xlabel('Episode')
plt.ylabel('Average Health')
plt.title('Health Evolution Over Time')
plt.grid(True)

plt.subplot(2, 3, 3)
plt.plot(innovation_history, label='Innovation')
plt.xlabel('Episode')
plt.ylabel('Average Innovation')
plt.title('Innovation Evolution Over Time')
plt.grid(True)

plt.subplot(2, 3, 4)
plt.plot(potential_history, label='Potential')
plt.xlabel('Episode')
plt.ylabel('Average Potential Private Consumption')
plt.title('Potential Private Consumption Evolution Over Time')
plt.grid(True)

plt.subplot(2, 3, 5)
plt.plot(composite_productivity_history, label='Composite Productivity')
plt.xlabel('Episode')
plt.ylabel('Composite Productivity')
plt.title('Composite Productivity Over Time')
plt.grid(True)

plt.tight_layout()
plt.show()

Technical Explanation of the Reinforcement Learning Process

Neural Network Architecture

In this approach, a Deep Deterministic Policy Gradient (DDPG) agent was utilized to optimize the allocation of resources among different groups in the productivity environment. The Actor-Critic framework was implemented, where two neural networks are used:

Actor Network: This network selects actions (investment in education, health, and innovation). The output is a continuous action space where each action is constrained to a range of [-1, 1] and is scaled to ensure balanced resource allocation across different dimensions.

Architecture:
- Three layers:
- Input layer: 128 neurons
- Two hidden layers: 128 neurons each
- Output layer: 3 neurons corresponding to the actions (education, health, innovation).
- Activation functions: ReLU activation for the hidden layers ensures non-linearity, and the output layer uses a scaled linear function to ensure continuous output that sums to zero.
Objective: The Actor network seeks to maximize long-term productivity through actions that influence knowledge, health, and innovation for three groups.

Critic Network: This network evaluates the actions chosen by the actor and predicts their potential rewards (or outcomes). It takes both the state (the current values for knowledge, health, innovation, and potential) and the chosen action as inputs and outputs a scalar value representing the predicted reward.

Architecture:
- Three layers:
- Input layer: 128 neurons
- Two hidden layers: 128 neurons each
- Output layer: 1 neuron corresponding to the predicted reward.
- Activation functions: ReLU activation for the hidden layers ensures the non-linearity of reward estimation.
Objective: The Critic network helps guide the Actor network by providing feedback on how good the chosen actions are in terms of achieving balanced productivity growth.

Key Parameters Used

Learning Rates:

Actor Network: lr_actor=0.001. A low learning rate is used to ensure that the policy changes smoothly and does not fluctuate too aggressively.
Critic Network: lr_critic=0.001. Similar to the actor network, the critic’s learning rate is kept low to stabilize learning.

Discount Factor (Gamma): gamma=0.99. This discount factor emphasizes long-term rewards, ensuring the agent makes decisions that benefit productivity growth over time, not just in the immediate steps.
Soft Update (Tau): tau=0.005. Soft updating ensures that the target networks (which are used for stability in learning) slowly track the main networks by blending the previous parameters with the updated parameters.
Reward Structure:

Improvement: The primary reward is based on the improvement in productivity across knowledge, health, and innovation.
Balanced Growth: If all dimensions grow in a balanced manner, an additional reward is granted to incentivize even development.
Penalties: Penalties are applied if there is significant imbalance between the different dimensions (e.g., if knowledge grows too fast but health or innovation lag behind). There is also a slight penalty for extreme actions to encourage smooth, controlled decisions.

Episodes: Training was carried out over 500 episodes, with each episode lasting up to 50 steps. This gives the agent enough time to explore different action strategies and optimize its policy.

Outcomes

The plots in the image depict the outcomes of this reinforcement learning process over the 500 episodes.

Knowledge, Health, and Innovation Growth:

All three dimensions (knowledge, health, and innovation) start from moderate levels and, over time, approach their maximum levels (with knowledge and health plateauing around 2.5–3.0).
Health and Innovation: There were initial drops in health and innovation, but the agent quickly learned to recover these values and sustain growth.
Knowledge: There is some fluctuation early in training, but knowledge quickly stabilizes and grows steadily over time.

Potential Private Consumption:

This metric represents the overall consumption capability that each group has. Initially, there are high fluctuations, indicating instability in how potential was managed. Over time, this stabilizes around a steady value, although with continued fluctuations, as potential is influenced by complex interactions between public investment (taxes) and private consumption.

Composite Productivity:

Composite productivity, which is an aggregate measure of the three main dimensions (knowledge, health, innovation), shows strong growth early on and eventually stabilizes at a high value (around 2.5). This demonstrates that the agent successfully learned to balance improvements across all dimensions.

Observations

Balanced Growth: The agent successfully optimized resource allocation across the three dimensions, achieving steady and balanced growth after initial fluctuations. The penalty mechanism for imbalance helped ensure that no single dimension grew disproportionately to the others.
Fluctuations in Potential Consumption: The private consumption potential shows more variability than the other dimensions, likely because it is tied to the interplay between public investments and consumption. This suggests that further fine-tuning could help stabilize this aspect.
Smooth Scaling of Actions: The reinforcement learning agent effectively balanced its actions to avoid extreme swings, as evidenced by the steady improvements across most dimensions. The penalties for extreme actions discouraged sudden shifts in resource allocation.

Individual and Sub-group Policy Optimization: A New Paradigm

In addition to demonstrating how reinforcement learning can be used to optimize policies for groups, this simulation also shows that creating individual or sub-group policies is not only feasible but also advantageous. The learning system we’ve implemented demonstrates that policies can be fine-tuned to cater to specific groups—such as the least productive, average productive, and most productive segments—rather than applying a broad, one-size-fits-all strategy. Here’s why this approach is superior to traditional centralized policies:

Tailored Policy for Each Group:
Unlike centralized economic policies that may apply the same rules across vastly different population segments, individual or sub-group policies can adapt to the unique needs and characteristics of each group. For instance, in the current simulation, we observe that the least productive individuals benefit from higher rates of growth, while the most productive individuals require a different type of steady support. This type of targeted intervention allows for more efficient use of resources, where policies are applied based on specific growth potentials and needs.
Real-time Adaptation to Dynamic Changes:
Traditional, centralized economic policies are often static or updated only occasionally in response to macroeconomic trends. These policies typically rely on high-level data and are slow to react to changes at the micro (individual or sub-group) level. With reinforcement learning systems, policy adaptations can happen in real-time, meaning that when certain groups face sudden challenges or unexpected improvements (e.g., due to a health crisis or technological advancement), the system can respond immediately to reallocate resources, adjust taxation, or modify investment strategies. This dynamic adaptability ensures continuous optimization without the delays associated with centralized decision-making.
Addressing Local Variability:
Centralized policies often fail to account for the local variability present across different regions or population segments. Different groups within a population may face varying levels of health challenges, educational access, or innovation opportunities, which cannot be captured effectively by a one-size-fits-all approach. In contrast, individualized policies consider the unique attributes of each group (e.g., innovation capacity, health status, potential). This local optimization is crucial for tackling disparities, as it ensures that resources are directed to the areas where they can have the most significant impact.
Fairer Distribution of Resources:
By optimizing policies on a per-group basis, reinforcement learning ensures that resources are distributed fairly. The simulation demonstrates that resource allocation is not only based on absolute metrics like health or productivity but also considers marginal improvements—meaning that smaller groups with greater potential for improvement are not overlooked. This helps prevent systemic imbalances that can arise when centralized policies favor only the most productive regions or sectors.
Continuous Learning and Improvement:
Unlike static policies that are reviewed only occasionally, reinforcement learning-based policy systems are constantly learning and improving. This means that even as population dynamics shift, technological changes occur, or external shocks happen, the system adjusts its policies accordingly. Centralized policies, by contrast, require significant time and bureaucracy to shift, often causing policies to be outdated by the time they are implemented.
Increased Efficiency in Public Spending:
Targeted policies also result in increased efficiency in the allocation of public funds. By learning which groups benefit most from certain interventions, the system avoids wasting resources on ineffective or unnecessary spending. This is particularly relevant for economic planning, where the challenge is often to do more with less. The system we’ve built shows that by focusing on group-specific needs—such as education investments in less productive groups or innovation incentives for highly productive groups—public spending can have a greater long-term impact on overall societal welfare.

Why This is Better Than Centralized and Occasional Policies

The traditional approach to economic policy often involves broad interventions that attempt to manage the entire population with a single set of rules. These policies may be applied only after lengthy legislative processes and are typically reactive, rather than proactive. The disadvantages of such centralized systems include:

Inefficiency: Centralized policies are less responsive to the changing needs of sub-groups and individuals. A policy designed for the entire population may be inefficient for specific groups, leading to a misallocation of resources.
Lag in Response: Centralized policies are slow to change. They typically require significant lead time to adjust to new circumstances. In contrast, a reinforcement learning-driven system can adjust policies continuously in real-time.
Lack of Precision: Generalized policies tend to overlook smaller groups or focus only on averages, neglecting the diversity within a population. This can lead to large segments of the population being underserved, further deepening inequalities.

Conclusion: A New Standard for Policy Optimization

In the future, as more data becomes available and computational capabilities continue to expand, individualized or sub-group policy learning systems like the one developed here could revolutionize economic and social planning. By dynamically adapting to the needs of various segments of the population, we can create smarter, more efficient, and more equitable policies that lift productivity, health, and innovation simultaneously across all levels of society.

The potential for reinforcement learning to provide a scalable, real-time, and highly adaptive system marks a significant departure from the traditional model of centralized, one-size-fits-all policymaking. In a world where populations are increasingly diverse and economic challenges are complex and fast-moving, the ability to personalize and optimize public policy may well become the new standard for achieving sustainable societal development.

Technical Appendix
Credits: Formulation / Ideation from Stefano Ciccarelli

Technical-Appendix-Economic-Policies Download