Forecasting electric power demand in the major Texas cities using new data science approaches.

Project by: Aarushi Ramesh, Neeti Shah, Ryan Root, and Anirudh Anasuri

Transmission Lines carrying electric power to fuel homes, buildings and other premises.


The power grid is a complex system that supplies electricity to numerous customers across the country. This system consists of numerous parts that help in power generation, transmission and distribution. An important aspect of the electric power grid has to do with balancing the supply and demand of electric power at all times. If the demand for electricity increases, and there is a low amount of energy available (to generate electricity), the electric frequency drops; therefore, the power production halts and this leads to a potential blackout. If the demand for electricity drops, and there is plenty of power available, the electric frequency increases; because power plants can only handle a certain frequency limit, the power plants may halt power generation and disconnect from the grid, and this again leads to a power blackout.

Winter Storm Uri had widespread impacts in many states across the U.S. and North America. The severe cold weather had drastic consequences on the electricity generation in Texas. Since there was a huge increase in demand for power, and a limited source of electricity, statewide power outages had to be implemented. However, this was not the only case of an imbalance in the supply and demand of electricity. There was a similar case of power grid failure and forced power outages during major heat waves and lightning storms in New York during the summers of 2019 and 1977.

From these events, we can see the importance of predicting and forecasting electric load accurately from various factors such as weather, landscape and area. We propose using efficient predictive analytics and newer data mining approaches to forecast power demand to increase the efficiency of the power grid.

Introduction & Background

Trace back to the time you went for a walk, or explored the outdoors. Did you spot a part of the power grid system during that time? Perhaps you saw the transmission lines that carry electricity to multiple premises, or saw a transformer or substation that plays a role in increasing/decreasing the voltage? These important aspects and parts make up our infrastructure of the power grid.

The electric power grid system is essentially split into several parts, with the initial step being electricity power generation. Electric power is then transmitted via high-voltage transmission lines, to distribution lines near neighborhoods and homes. The key aspect of the power grid is to ensure stability; the electricity supply needs to be meeting the electricity demand.

Power Grid Diagram

In other words, if there is an imbalance in the power grid system, blackouts in local or widespread areas can occur. In order to combat the risks of a system imbalance, we are seeking to explore newer machine learning models and techniques to accurately forecast power demand on a daily basis for regions in Texas.

Problem Being Addressed

Our goal is to accurately predict and forecast the power load on a daily basis based on various and crucial factors so we can efficiently and proactively know when the demand of electricity has fallen or is on the verge of crashing the system. To accomplish this and find an efficient solution to our problem, we created a plan.

To run and train a machine learning model and to predict future values, we first need data! ERCOT, also known as the Electric Reliability Council of Texas, is an organization that oversees and operates the majority of Texas’ power grid. ERCOT divides the state of Texas into 8 weather regions. These regions have very similar climate characteristics. ERCOT’s region dataset consists of power demand (in megawatt hours) for every region, on an hourly basis. Though ERCOT runs powerful and accurate linear regression models to forecast the demand, we wanted to explore ERCOT’s region data with different time series forecasting models and techniques. The below graph plots ERCOT’s hourly data from Jan-Mar 2021.

The blue dashed line at the top of the graph indicates the February winter storm time period. From the plot, we can see the demand peaks taking place during the week. The North Central region peak reaches its highest point (at around 25,000 Mwh) during the storm time. During the winter storm time period, the demand drops immediately for almost all regions after their peak demand. We infer that this may be due to the rolling blackouts to balance the supply and demand. Our goal is to develop a model that takes important features into account and outputs a forecast that is within a small range of the true value.

Once we collect our data and perform processing and EDA techniques, we will explore different approaches into time series forecasting. The models we will be testing are a multi Linear Regression, Polynomial Feature Transforms, XGBoost, Random Forests, CatBoost, and LightGBM.

Our Process

Data Collection and Preparation

For our data collection step, we had our target variable: the power demand in a Texas region from ERCOT’s dataset. However, we still required a set of features/inputs. After plenty of researching, and due to the lack of data available and dependable detailed weather data available online, we decided to use the following features we had access to (from various sources):

  • Date (Year, Month, Day)
  • Temperature (in F)
  • Wind (Avg MPH)
  • Precipitation (Inches)
  • Population (by Year)

Given that Texas covers a large geographical area (bolstered by the denotation of ERCOT regions), obtaining reliable weather record data from regions of interest, in particular around major Texan cities such as Austin, Houston, and Dallas would be valuable in trying to forecast future power demand. The three main cities we will be looking at are: Austin (South Central Region), Houston (Coastal Region), Dallas (North Central Region).

ERCOT Texas Weather Regions
Our Dataset for the Austin/South Central ERCOT region: SCENT = South Central Demand

Data Preprocessing

Data Normalization

To preprocess the data, we implemented several techniques. First, we dealt with missing values in the datasets. To do so, we replaced NaN precipitation values with 0 and replaced missing wind data with the average. We considered just deleting these rows, since we did not want to add bias to our dataset. However, when testing the resulting model, the former method resulted in better accuracy.

The next challenge was normalizing all feature values. Since our data was all pooled from various sources, the time/date variable especially was in all different formats. For example, some would have them in a TimeStamp object. There were also other formats such as simple date strings with slashes as delimiters, strings with dashes, and even strings with words instead of numbers. Eventually, we were able to reformat these into a simple “DD/MM/YYYY” for the Date column, and also created 3 new respective “Month”, “Day”, and “Year” features for easy access. To normalize the data, we used min-max feature scaling, for our model to train efficiently. The min-max feature scaling is a crucial data preprocessing step, since we had various ranges of feature variables (for instance, our Electricity demand target was in the Megawatt hours range, whereas the population was in the range of millions).

Lastly, we then One Hot Encoded our categorical features, which was just the date. We did so on the Month, Day, and Year columns successfully.

Dimensionality Reduction using PCA

Our dataset was a multidimensional dataset, consisting of the Year, Date, Month, Precipitation, Population, Wind, Temperature and the target. We wanted to encompass most of the variation and information in the dataset in a smaller set of dimensions or features, to speed up the training process and make it more efficient/simpler. Therefore, we decided to use the Principal Component Analysis technique. This technique essentially finds the ‘principal components’ or the features that point in the direction of the data to maximize the variance and collect as much information as possible in the dataset. We decided to use 4 principal components, and applied PCA after performing normalization.

We then plotted the two most important Principal components — 1 and 2. These two principal components capture the most information and variance in our data. We then plotted the data points and color coded the graph depending on the range of values of our target variable (Megawatt hours, normalized from 0–1). Red points represent normalized values from 0–0.2, blue points represent target values from 0.2–0.5, green points represent values from 0.5–0.8, and yellow points represent values from 0.8–1.

From this graph, we can visually see that the red points that represent lower electricity consumption are near the top left sides of the graph, and the blue/green points that represent greater electricity consumption are on the bottom right sides of the graph. Since we are dealing with continuous time variables, we felt visualizing with PCA gets trickier in this case, however, there are slight patterns to be seen.


Multi Linear Regression

This model was used as a baseline to estimate the accuracy of the model and to see if the model itself can be modeled through a linear relationship. The model not yielding a high accuracy led us to believe that the energy output can not be modeled from the various factors as a linear relationship. The residual plot outlined below showed that a linear fit was not a good representation of the model.

  • Score: 0.24902846081911234
  • Coefficients: array([[ 0.01292622, -0.23920232, -0.08551818, -0.53612113]])
  • Intercept: 0.24902846081911234
Regression analysis for South Central Region
Residual Plot for South Central
Residual Plot for North Central

The above plot to the left is a histogram plot of the residuals; aka the difference between the predicted and true values in the normalized Megawatt hours. Residual histograms provide meaningful information about a model. A series of errors or residuals should ideally follow a random distribution, since residuals aren’t predictable. Therefore, one can completely trust their coefficients of the model only if the residual plots don’t have explicit patterns/skewed distribution.

These residual plots indicate that this model does not output the best performance. This is because the residual plot is skewed, and isn’t in a random distribution. This indicates that the model isn’t well generalized, and is facing challenges in obtaining a best fit.

Polynomial Feature Transform in Linear Regression

After considering and analyzing the previous residual plots and the time series graph of the electricity demand for 4 years in the South Central region, we realized there are some non-linearities present in our data that might not be best captured by a linear model. Since the multiple linear regression was not the best fit for our data, we decided to transform our feature variables to higher degrees, to better represent our non-linear data. This is also called Polynomial Feature Transformation.

Time Series Graph of Electricity Demand
Polynomial Transformation Function
Residual Plot of South Central, after Polynomial Transform

The residual histogram plot in this case represents a random, close to a Gaussian distribution. The values are mostly centered towards zero, and the plot shape resembles that of a Gaussian distribution.

This indicates that our model is more generalized and captures the non-linearities presented in our dataset very well.

  • Score: 0.7229388449348859, with degree of 3
  • Score: 0.7311397116358878, with degree of 5

The score of the model is around 0.722, with a degree of 3, which indicates that this model is performing well and providing a great fit.


Catboost was chosen as a viable model for this dataset because of the flexibility it has when dealing with categorical variables, particularly in this instance where each date had to be one-hot encoded. When trying out various hyperparameters for the model, the ones that yielded the best result was when the depth=2, learning_rate=1, and the iterations=200. The depth signifies the depth of the tree, the learning rate is the speed at which the model learns the information, and the iterations represent the number of trees that were created in the process. Changing the iterations affected the score the most. Initially, the iterations were set to 2, but as more hyperparameter tuning was conducted, the score increased from 0.82 to roughly 0.9327. This drastic of a score improvement does make sense since they are little trees, and the information to which the model can fit leads to underfitting. At more than 200 trees, the model began to overfit, and started to yield lower scores.

Random Forest

Random forest was also another model that was chosen because of its reliability in terms of its ability to control overfitting. The random forest method fits a specific number of decision tree classifiers on other sub-samples of the data set which in turn can help with controlling overfitting. The hyperparameter tuning was done in a similar manner as the catboost hyperparameter training; however, there are some differences. The key hyperparameters that affected the accuracy of the model were the number of estimators, and the max_depth. Unlike in the catboost model, the max_depth played the largest role in improving the accuracy. From a depth of 2 to 20, the score increased from around 0.82 to 0.93. From our observations, the depth plays a much larger role since the trees allow for some of the fine-tuning of the data to be performed. With such little depth, the random forest trees were underfitting, and above a depth of 20, the model began to overfit.


Light GBM is another boosting model that was utilized in order to help understand the relationship between the features and the overall energy output. This model was chosen instead of XGBoost because while XGBoost may deliver slightly better accuracy, the overall amount of resources in order to deliver slightly better results was not worth the tradeoff. The time it takes for the model to train is significantly longer than for LightGBM and the results it yielded were great as well.

This was the model that required the least amount of hyperparameter tuning, as the default parameters yielded a score of almost 0.92; however, similar to the random forests model, the max_depth parameter was one of the factors that affected the accuracy of the model as well. Additionally, the num_leaves parameter was crucial in determining the best score for the model. This parameter is crucial because the trees in which the LightGBM model is constructed are leaf-wise trees compared to the depth-wise trees in other models. Setting the num_leaves equal to the maximum depth can actually lead to overfitting, and the amount of leaves does not translate to a directly positive or negative relationship with the score.

Attached below are some attempts at getting the best accuracy while trying various numbers for the num_leaves:

  • Score: 0.9330842377062655, with num_leaves = 20
  • Score: 0.93322597305019882, with num_leaves = 9
  • Score: 0.9335116249062436, with num_leaves = 8
  • Score: 0.9314792436282318, with num_leaves = 5

As you can see from the images and many more trials, the num_leaves that gave the best score was at 8, and utilizing too many leaves which are close to the number of total leaves in the tree (which in this case is 32), yields to overfitting and an overall decreased score.

Evaluation and Result(s)

The results are shown below. These results are the electric power demand predictions for the day October 29th, 2015. (a random date we picked to have a common ground to test out on.)

Results from different models


By observing and analyzing the results for just this one random day we chose to compare, we can see that our models are producing results that are ~100-~2000MWH apart from the true value, on average. However, these predictions do still capture an overall estimate of what the true demand will look like. An interesting observation we made was that for the South Central region, all of the models predicted a value higher than the true demand. For the other two regions, the predicted value fell short of the true value. Getting a precise and as accurate demand prediction is extremely important, however, having a good range of values as a prediction is also crucial. There is a slight trade-off when it comes to predictions with these models; as the complexity of the model increases, we might get better, closer-to-the-true-value results for some regions/data, however, the results are mainly only accurate for a few areas/data points in the region. Therefore, a complex model is not often the best fit for plenty of data. However, the more generalized a model is, the farther the range between the true/predicted value, but the range appears to be similar throughout different data points/regions. So, there is a trade-off between accuracy and complexity of a model.

Conclusion and Lessons Learned

While we may have been overzealous in what we sought out to accomplish with our initial goal (we initially wanted to compute a forecast for the entire state of Texas!) given the limited time constraints, we were successful in being able to reach our target in select regions, with each containing a major Texas city. Being able to focus on these regions prevented us from spreading ourselves too thin, and enabled us to go into more depth through consideration of additional and more diverse data inputs.

Furthermore, we learned the extent to which data acquisition plays a role in the ability to create and build accurate models. While we had difficulty obtaining reliable datasets, it also paved the way for us to create our own, which although limited, acts as a glimpse into what may be added going forward.

Feature Engineering Conclusion

As per the chart, feature 0 represents the temperature, feature 1 represents the average wind speed, feature 2 represents the precipitation, and feature 3 represents the population. Unsurprisingly, temperature ended up being the most important feature since if the temperature is low, then more individuals will turn on the heater, and the opposite holds true — a higher temperature leads to more air conditioning units being powered, thus the temperature being the dominant feature in terms of power consumption was not a surprise.

Future Work

For future work, there are several things we would like to implement given more time. First we would collect and merge more data to create a more robust and expansive model.

For one, we would add the cost of electricity, as well as some metric for the health of the economy as additional features. If the population as a whole has less money, they would generally be much more reserved in their electricity usage. We actually were able to find a reliable dataset for the cost per kWh, but were not able to easily integrate it into our data frame due to conflicting file formats. This data, plus our population feature, would provide for a more insightful analysis on the socio-economic influences in electricity usage.

Additionally, we would expand the number of weather datasets to incorporate more cities, such as Dallas and Houston, to reflect different ERCOT regions. They each would have vastly different weather patterns, which greatly impacts the electricity consumption. Having more data as so would make the model even more accurate and useful for people outside of Austin.

Next, we also would integrate “Auto Regressive Integrated Moving Average” (ARIMA) models as well. ARIMA is especially useful for forecasting as it evaluates events over time. Since all of our data points are associated with a date, ARIMA would be a great approach in order to correctly analyze seasonal trends and provide further flexibility.

Potential Business Impact

Lastly, we would want to develop a more user-friendly interface. With our current model(s), the user manually has to input the weather forecast, as well as the future population size, to receive a prediction for the day’s energy consumption. Instead, it would be much easier for the user if they could just input a date and receive back the model’s estimation. Exploring with different models and observing/comparing the MwH each model predicts could be in an interactive portal for businesses/homes/users to check out. To do so, we could integrate even more data from The Climate Explorer. This source provides daily future forecasts of the weather. We can then feed the forecast from the particular date requested by the user into the model and return back the prediction made, ultimately creating a fluid and simple interface.


Project/Code Links

Hello! I’m a student at the University at Texas in Austin. Welcome to my collection of thoughts. I like to write and blog.