The potential of improving human productivity by providing healthy indoor environments has been a consistent interest in the building field for decades. This research field’s long-standing challenge is to measure human productivity given the complex nature of office work. Previous studies have diversified productivity metrics, allowing greater flexibility in collecting human data; however, this diversity complicates the ability to combine productivity metrics from disparate studies within a meta-analysis. This study aims to categorize existing productivity metrics and statistically assess which categories show similar behavior when used to measure the impacts of indoor environmental quality (IEQ). The 106 productivity metrics compiled were grouped into six categories: neurobehavioral speed, accuracy, neurobehavioral response time, call handling time, self-reported productivity, and performance score. Then, this study set neurobehavioral speed as the baseline category given its fitness to the efficiency-based definition of productivity (i.e., output versus input) and conducted three statistical analyses with the other categories to evaluate their similarity. The results showed the categories of neurobehavioral response time, self-reported productivity, and call handling time had statistical similarity with neurobehavioral speed. This study contributes to creating a constructive research environment for future meta-analyses to understand which human productivity metrics can be combined with each other.
Healthy building certification systems have quickly grown their project portfolios . This movement demonstrates the growing interest in creating healthy indoor environments in buildings, considering the value of occupants' health and productivity. From the employer's perspective, if their employees could become more productive, it would help justify the upfront refurbishment costs or additional operational costs. The cost of employees to an employer accounts for 92% of building costs, while utility bills account for only 6% . This scale gap suggests a large opportunity for financial benefits by providing indoor environmental conditions that contribute to occupant productivity. Specifically, potential office worker productivity gains from improved indoor environments have been estimated at up to $230 billion (in 2016 U.S. dollars) at the national level .
The human productivity impacts due to variations in indoor environments in office buildings have been a consistent research interest for several decades (e.g., Ref. ). Building systems are the main vehicle for facility managers to control the indoor environmental quality (IEQ) of their building and therefore physical comfort and well-being of the occupants. Therefore, gaining the knowledge of how IEQ contributes to human productivity is important to determine which IEQ parameters to target and to what extent. For example, Gupta et al.  demonstrated that task performance (numerical calculation and proofreading) was impacted by air temperature and carbon dioxide (CO2) concentration, and the occupants in naturally ventilated offices had higher tolerance to IEQ variations. Kawamura et al.  asked human subjects to perform three-digit multiplication tasks under different IEQ conditions and showed that subjects prioritized the thermal and acoustic environments and perceived that they performed better when satisfied with IEQ. Niemelä et al.  used a computerized monitoring system to compute call center workers' productivity (the number of telephone communications divided by the active work time) under different IEQ conditions and found that the average productivity decreased when the air temperature exceeded 25 °C.
Although the previous studies demonstrated the trends of human productivity under different environmental conditions, they used a variety of human productivity metrics, including office workers' task speed, reaction time, scores of cognitive performance tests, and self-reported productivity. The use of diverse human productivity categories in this research field allows flexibility in conducting individual studies, but it poses several challenges. First, each productivity metric accounts for different aspects of performance (e.g., speed, accuracy). Second, meta-analyses are often conducted in this research field (e.g., Refs. [8–11]) due to the small participant sizes in individual studies and difficulty of studying human subjects. Hence, when productivity data using different metrics are compiled, it is challenging to conduct a meaningful meta-analysis. Specifically, Seppänen et al.  used productivity data from literature with a wide range of metrics, from objectively reported work performance to cognitive performance test scores, to create a human productivity prediction model with respect to air temperature. Even though all these metrics are important components of human productivity, they have different scales, units, and relationships to overall performance. Therefore, the results from such a model could become impracticable without careful consideration. In the end, this research aims to address the question of whether different types of human productivity metrics can be combined directly.
In order to tackle such challenges, this study categorized productivity metrics diversely used in previous studies based on common attributes and units of measurement (hereinafter, these are called productivity metric categories). Then, the productivity metric categories were compared statistically using summary statistics, graphical representation, and a pairwise t-test. This approach used productivity metrics that measure the efficiency or speed (i.e., a ratio of output to input) of office work, categorized as neurobehavioral speed, as a baseline for comparing other productivity metric categories. An example of an efficiency-based productivity metric is the time it takes to complete a cognitive task, such as a visual learning module, measured in units completed per hour. Examples of other productivity metric categories are accuracy and performance score (more details in the Methodology section). If statistical similarity is found between two productivity datasets, there is some evidence supporting the ability to directly combine the productivity metrics in these categories within a human productivity meta-analysis. If there are incongruencies in the statistical behavior of two categories, it is suggested that careful consideration is needed to combine these categories in research studies. In summary, this study aims to evaluate the statistical similarity of the data from various productivity metrics. In the end, this research contributes to creating a constructive research environment for future meta-analyses in this field.
This paper is structured as follows. The Literature Review section provides a summary of human productivity metrics used in previous studies and meta-analysis efforts on office performance influenced by IEQ. The Methodology section details the steps taken to evaluate productivity metrics in this study, including how the data from existing productivity and IEQ studies were collected, how these data were processed in preparation for a meta-analysis, and an explanation of the statistical methods used to analyze the data. The Results section presents a detailed list of literature studies compiled to support this study, the results of the statistical tests conducted, and a discussion of the findings. In the Conclusion section, the limitations and the future directions of this research are discussed.
Office work fits in diverse contexts ranging from manufacturing-based to knowledge-based . This diverse nature of office work adds complexity because many office workers conduct different types of tasks and skills on a daily basis. Ultimately, analyzing an individual office worker's productivity becomes challenging.
Studies tried to solve this by focusing on easily quantifiable office work or diversifying productivity metrics. For example, many field studies observed the call handling time of call center representatives [7,14] or nurses , given the ease of collecting these data. In laboratory-based studies, cognitive performance tests have been widely used, given the association of cognitive function with office work. Even though this approach does not directly measure office work in a field setting, it offers a simulated situation where participants' productivity can be quantified and external conditions can be controlled. The tests to measure cognitive performance include numerical calculation, typewriting, memory (e.g., word, number, image), and reasoning tests (e.g., numerical, alphabetical, conditional, spatial, etc.) [15–17]. Studies have measured speed, response time, and/or accuracy as the human productivity metrics in such tests. Other studies utilized self-reported productivity, a subjective metric. The National Aeronautics and Space Administration (NASA) Task Load Index (TLX)  has been used in a number of studies [19,20]. In this index, participants report their productivity on a 20-level Likert scale (from perfect to failure). Similar questions were posed in other studies [21,22].
To take a holistic perspective of office work, it is important to look at a variety of productivity metrics; however, as discussed above, combining metrics may confound the results in a meta-analysis. For example, response time, speed, score, and accuracy are widely used metrics from a cognitive performance test. Each metric shows an important aspect of productivity with regard to IEQ , but their scales and units might not align in a way that makes them directly compatible with each other for an overarching productivity metric. As an attempt to mitigate this issue, some studies came up with new metrics to combine accuracy and response time [11,14], normalized the values measured in each metric as a percent improvement , or applied weights to the different types of tasks based on their application to general office work [12,23]. However, such attempts did not quantitatively analyze the applicability of joining the various metrics to each other.
This challenge has not been addressed in any previous meta-analyses. Seppänen et al.  collected studies that correlated ventilation rate to office productivity and created a regression model. This study weighted various productivity metrics based on their relative relevance to real work using the authors' judgment. A similar approach was taken in Ref. . Some studies directly combined existing data regardless of productivity metrics without discussing their compatibility [24,25]. Another research study compiled existing regression models relating thermal comfort and air temperature to productivity decrement to recreate a model that can be adapted to various work tasks . This study also did not discuss the suitability of the productivity metrics themselves to be combined directly within a single model.
Hence, this research aims to support future studies where human productivity data are measured with diverse metrics that are simultaneously combined to gauge the impacts of IEQ in office environments. This research will help inform which types of data should be combined and which types of data require caution before combining with other types.
In order to collect the articles that investigated the impact of IEQ on human productivity in office buildings, we leveraged literature databases like Science Direct, Wiley, and Google Scholar to identify peer-reviewed publications. In such databases, the following keywords were individually and collectively used: office, productivity, performance, cognitive score, self-reported productivity, IEQ, thermal comfort, temperature, ventilation, CO2, lighting, and horizontal illuminance. Then the title and the abstract of each study were read to see whether they aligned with the research interests. In addition, existing literature review papers and meta-analyses (e.g., Ref. ) were used to track down the studies in those publications to build an extensive collection of studies.
The criteria for including a study in this literature collection were the following: (1) it measured a specific change to the IEQ parameter (such as a general retrofit to the building (e.g., Ref. ) was excluded); (2) it defined the productivity metric(s); and (3) it recorded the impacts. Regression results from meta-analysis studies were not included; rather, the original studies were identified and reported directly in this collection. The IEQ metrics that were most available in literature studies that looked at productivity and performance variation were CO2 , ventilation , thermal comfort , and horizontal illuminance . These IEQ metrics are easily quantifiable in buildings and cover a wide range of IEQ categories (indoor air quality, thermal comfort, and lighting). In the end, the literature collection included 32 studies.
After compiling the literature collection, the number of data points, number of subjects, study duration, and number of unique productivity metrics were extracted. The number of data points refers to the number of unique changes in IEQ conditions multiplied by the number of productivity metrics recorded in the study. For example, if a study changed CO2 levels from 600 ppm to 1000 ppm and recorded three productivity metrics at each condition, that would be three testing data points that could be analyzed individually. If the study tested three CO2 conditions (e.g., 600 ppm, 1000 ppm, and 1500 ppm) and recorded three productivity metrics at each condition, that would be six data points (one of the CO2 conditions is the baseline and the other two can be the experimental conditions). Note that the number of subjects and duration recorded refer to each data point, not the entire study. For example, if the subject population spent one week at 1500 ppm CO2 and 1 week at 700 ppm CO2, the duration recorded would be 1 week (7 days) even though the entire study lasted 2 weeks. Studies that took place in less than 1 day (i.e., several hours) were recorded as 1 day.
There were a number of confounding factors (e.g., outdoor air temperature, season, acoustic conditions) that could be controlled for within studies but not between studies. It was found that studies did not consistently report all IEQ conditions when they were not the independent variables (e.g., a study varied lighting levels, but did not collect CO2 or air temperature data). Therefore, the effects of different baseline IEQ and other conditions on the experimental parameters could not be included in this research.
In the literature review, some important caveats were identified in IEQ parameters. In the case of CO2, some studies found a difference in artificially introduced CO2 into a space and human-produced CO2. The studies that measure both have found that artificially introduced CO2 may not cause a significant decrement in productivity compared with human-produced CO2 . It is suggested that CO2 itself is not a harmful pollutant at the levels typically found in buildings, but rather a proxy for human bioeffulents and other comorbid indoor air pollutants that are not being circulated out due to low ventilation . For that reason, artificially and human-produced CO2 were considered as separate IEQ parameters.
In the case of thermal comfort, two prominent variables were used in studies: predicted mean vote (PMV) and ambient temperature. PMV is a more holistic variable, taking into account ambient and radiant temperature, relative humidity, airflow, clothing level, and metabolic rate into its calculation. Eight of the 19 thermal comfort studies in our literature collection provide both PMV and temperature data and are included under both of these IEQ parameters. Some of these studies only gave temperature but included enough information to infer PMV. If the study included temperature and humidity values and at least a qualitative description of metabolic rate and clothing level, we calculated the PMV values using Center for the Built Environment's Thermal Comfort Tool .
For the last step of preprocessing, the productivity metrics were grouped into categories based on contextual similarity in order to strengthen the statistical analyses. For example, time to complete an addition task and time to complete a subtraction task were grouped into one category. More details on the attributes used to group metrics are included in the Results section. As noted, the approach in this research prioritized the productivity metrics interpreted as efficiency (i.e., a ratio of input to output) for a basis of reference for comparing with other categories.
Analyses of Productivity Metrics.
Before comparing the impacts of each productivity metric category, the relative productivity was calculated using Eq. (2). To maintain consistency in the positive and negative orientation of the results between IEQ parameters, the criteria for “worse” and “better” IEQ conditions were determined as the following:
The greater the CO2, the worse the IEQ.
The greater the ventilation rate, the better the IEQ.
The greater the horizontal illuminance, the better the IEQ.
If the units of P are considered better if they are less (e.g., number of errors, reaction time), then the value of RP is multiplied by −1 so that a positive value of RP corresponds to an improvement in productivity. If the units from the study are already in a percentage (e.g., percent errors, percent self-reported productivity), then the difference between the percentages is used instead of the percent change.
Three statistical methods were used to compare the datasets: summary statistics (mean, quartiles, variance); the density of values of each dataset represented graphically; and a pairwise t-test. The summary statistics convey the characteristics of the datasets in an easily interpretable fashion, the density plots graphically represent the data and their distribution, and the t-tests provided a more definitive conclusion concerning the similarities of the datasets.
The R programming language was used to process these steps. These three methods were applied to (1) each productivity metric category and (2) each category separated by IEQ parameter. Throughout the remainder of this publication, they are referred to as the first analysis and second analysis.
At the end of the Results section is a summary of the three methods of comparison for each productivity metric category. In this comparison, the summary statistics and density plots are assigned a rating on a six-point scale from very similar to very different based on qualitative observations of the data relative to the performance of the other categories. The t-test is assigned a rating on the same scale based on the p-value.
From the literature collection, a total of 106 productivity metrics were identified and grouped based on common attributes and units of measurement into the following six categories (Table 2). This categorization allowed for a greater sample size for the comparison of productivity metrics. Neurobehavioral speed most closely aligned with the definition of efficiency-based productivity described in the Introduction section, because the units were in output (tasks completed) per input (employees' time). Call handling time was also aligned with efficiency-based productivity in terms of units (e.g., calls completed per hour), but it was kept as a separate category because it was unique to the other metrics in that it required interpersonal communication and was typically collected in field studies as opposed to laboratory studies. Neurobehavioral response time looked at small timescale (seconds or milliseconds) compared with the neurobehavioral speed metrics (minutes, hours) and does not measure any type of work output. When human productivity was measured in a testing program with a score (e.g., on a scale of 10 or 100, as determined by the test creator), regardless of the focus area, it was categorized in the performance score category. These tests included the Strategic Management Simulation (SMS) test (e.g., basic/applied/focused activity level, task orientation, crisis response information seeking/usage [59,60]) and cognitive performance tests like digit-span memory, picture recognition, symbol-digit modalities, text typing, Tsai-Partington, creative thinking, executive function, and cognitive flexibility. Five out of the 16 studies that have performance score metrics used SMS [28,40,43,45,46].
Statistical Evaluation of Productivity Metric Categories.
Table 3 shows the summary statistics of the datasets for each productivity metric category. The metrics included are the mean, variance, and quartile values after applying the weighting factors from Table 1.
The results indicated that neurobehavioral speed was most similar to neurobehavioral response time and somewhat similar to self-reported productivity and call handling time, considering primarily the value of the means and also looking at the percentiles and variance. Performance score was the least similar to speed, with a mean almost three times larger and a large variance. Figure 1 shows the same results graphically. The curve for performance score showed a wide distribution, skewed to the right and extending past 20% on the x-axis where the graph is cut off. Accuracy shows a very narrow distribution compared with the other categories and neurobehavioral response time, neurobehavioral speed, self-reported productivity, and call handling time were fairly similar in distribution.
Table 4 shows the unweighted p-value results of the pairwise t-test. The six productivity grouping datasets did not meet the heteroskedasticity assumption of an anova test (Levene's test p-value < 2.2e−16) because the sample size of each group did not have equal variance. Therefore, a nonparametric pairwise t-test adjusted by the Benjamini-Hochberg with no assumption of equal variance  was conducted. Neurobehavioral speed did not show a significant difference (i.e., showed similarity) to call handling time, neurobehavioral response time, and self-reported productivity.
Statistical Evaluation of Productivity Metric Category Behavior by Indoor Environmental Quality Parameter.
The second analysis divided the data by IEQ metrics. Although this reduced the sample size, it gave the ability to look at per-unit changes to productivity for comparison. Horizontal illuminance did not have enough data points to analyze individually and was not included in this section. Some of the productivity metric categories no longer had a robust sample size within certain IEQ metrics and those categories were omitted from the analyses and discussion. It should be noted that there are no conclusions being made in this research about the behavior of productivity metric categories within IEQ metrics; rather these observations will be compiled to draw insights about the productivity categories as a whole, outside of any specific IEQ metric.
Within human-produced CO2, ventilation, and PMV, the productivity category datasets did not meet the heteroskedasticity assumptions of the standard anova test (Levene's test p-value < 4.6e–06, < 8.8e–05, and < 0.032, respectively). Therefore, a nonparametric pairwise t-test adjusted by the Benjamini-Hochberg method with no assumption of equal variance was conducted, just as was performed in the first analysis. Although within artificially introduced CO2 and ambient temperature the productivity categories met the heteroskedasticity assumptions of the standard anova test, we used the same nonparametric pairwise t-test for consistency.
In Table 5, for human-produced CO2, call handling time and neurobehavioral speed had similar means but dissimilar quartiles. Call handling time had a relatively small sample size for human-produced CO2 and neurobehavioral response time had a small sample size in both tables. These categories were therefore excluded from the pairwise t-test and graphical representations in this section. No self-reported productivity data were identified for CO2. Accuracy and neurobehavioral response time had means very close to zero, suggesting there was not a noticeable relationship between CO2 and these productivity metrics. Performance score had a very large mean, suggesting it is not directly compatible with the other metrics. For artificially introduced CO2, there was also a large variance for performance scores (Table 6). This means that the various metrics within the performance score category may not be compatible even with each other.
Figure 2 shows the density distribution plot for productivity improvement results of studies by productivity metric category and by human-introduced CO2 (Fig. 2(a)) and artificially introduced CO2 (Fig. 2(b)). The horizontal extents of the two plots were limited to −20 to + 40%/1000 ppm to better show the behavior of the plots around the means, even though the data of performance score extended farther. The narrow curves with a high spike signify cohesiveness in the performance impacts, whereas the flat wide curves indicate there is high variability in the results.
Call handling time and neurobehavioral response time were removed from Tables 7 and 8 due to small sample size. Table 7 shows that performance score had low similarity to all other datasets for human-produced CO2. All other datasets for human-produced CO2 and all datasets within artificially introduced CO2 were similar to each other, respectively.
Within ventilation rate, call handling time, and neurobehavioral speed had similar means and similar quartiles, as shown in Table 9. Performance score had a much larger mean and percentiles than the other groupings. Call handling time and neurobehavioral response time had small sample sizes and were therefore excluded from the pairwise t-test and graph in this section. Figure 3 shows that performance score had a very wide distribution compared with neurobehavioral speed, and accuracy had a slightly narrower distribution.
In Table 10, all the categories included showed significant differences from each other.
In Table 11, for temperature, neurobehavioral speed aligned somewhat well with call handling time. In Table 12, for PMV, neurobehavioral speed did not align with any of the other groupings. Performance score had a negative mean, suggesting an inverse relationship to PMV.
Tables 13 and 14 show the results of the t-test for temperature and PMV, respectively. Call handling time did not show significant similarity to accuracy or neurobehavioral response time for temperature. Accuracy did not show significant similarity to neurobehavioral response time for PMV. The remaining categories presented did not show any significant differences from each other.
Summary of Results and Discussion.
Table 15 shows the summary of the results using neurobehavioral speed as a basis of reference. This table includes the statistical summary statistics (focusing on the mean as a value for comparison), a comparison through graphical observation (focusing on the distribution of data), and the pairwise t-test results. Each comparison in the table is given a rating: very similar, similar, somewhat similar, somewhat different, different, or very different. For the summary statistics and graphs, these ratings are assigned qualitatively based on the results presented. For the pairwise t-test, the rating is assigned based on the p-value.
Artificially introduced CO2, with the except of performance score, had minimal impact on productivity results. This could be, as other studies have suggested, because CO2 is an indicator of comorbid indoor air pollutants and ventilation rate and is not a harmful pollutant itself in the concentrations typically found in buildings. The data we collected for this IEQ variable are therefore not used in the comparison of productivity metrics categories in Table 15. Temperature and PMV were combined into thermal comfort for the summary because there is some overlap in the studies these data come from and because the results were generally similar.
Based on the summaries in Table 15, all of the p-values for call handling time, neurobehavioral response time, and self-reported productivity showed no significant differences when compared with neurobehavioral speed. In addition, the majority of qualitative ratings based on the statistical summaries and graphical comparisons for these categories were somewhat similar, similar, or very similar. The majority of the p-values for accuracy and performance score (with the exception of thermal comfort analysis) showed significant difference compared with neurobehavioral speed. In addition, the majority of the qualitative ratings are somewhat different, different, or very different.
This study has analyzed human productivity metrics used to measure the impact of IEQ in office buildings. The approach taken defines productivity as an efficiency (ratio of input to output), which has paved the way for investigating each of the productivity metric categories. In previous studies and meta-analyses, the absence of this analysis confused the feasibility of combining productivity metrics in meta-analyses.
The studies that aligned with our research interest were compiled, and each study's data points, IEQ conditions, productivity metrics, and productivity impact were extracted. The 106 productivity metrics were grouped into six categories and neurobehavioral speed category was set as the baseline given its fitness to our definition of productivity. The other five categories (neurobehavioral response time, accuracy, call handling time, performance score, and self-reported productivity) were analyzed statistically. The neurobehavioral response time, self-reported productivity, and call handling time metrics were found to have statistical similarity to the baseline. Based on the results of this study, regression models from meta-analyses that incorporate performance score or accuracy metrics with other productivity categories should be scrutinized. In future studies, more care should be given to defining productivity and evaluating how studies fit into this definition, so the results of those studies are meaningful.
Data scarcity is a significant limitation of this research. All the categories would benefit from more data points due to the complex nature of this field and from studies with larger participant sizes and study durations. The authors note that this research is not conclusive and only suggestive of the true nature of productivity metrics given the limited data that were used for the findings. Furthermore, this research would benefit from the inclusion of more IEQ parameters, such as volatile organic compounds, particulate matter, daylight, and acoustics, given their impacts on human productivity in buildings. This would provide a larger picture of the building-occupant nexus.
There are a number of confounding factors that could be controlled for within studies but not between studies. For example, one study could look at the effects of CO2 at 30 °C and another could run the experiment at 21 °C. The studies were conducted at various times of the year and in various climate zones, which can impact the indoor environment and human behavior. It is expected that, in general, the results will be similar in nature; for example, decreasing CO2 at a constant 30 °C will likely improve productivity if decreasing CO2 by the same amount improves productivity at 21 °C; however, the magnitude of this productivity improvement may be dampened or exacerbated due to the interaction of environmental parameters. Because these confounding factors are not reported consistently across studies, it is not possible to take them into consideration and still maintain a large sample size for the meta-analysis.
This research is applicable to productivity in general for office workers. In future research, we plan to explore how the diversity of work functions and types could be addressed to make this research more applicable to individual buildings. Different types of cognitive tasks may be more or less applicable to different types of office work. For example, a digit-span memory test may be a suitable metric for a worker who regularly exercises executive function skills but may not be applicable to a worker who relies mostly on innovation and creativity skills. In a future where there is uncertainty around return-to-work status and a potential norm of increased remote working, there is the possibility of applying research like this to remote workers to understand how the IEQ conditions of employees' homes could be creating an opportunity for employer-sponsored home improvements.
The research is funded by the U.S. Department of Energy (DOE) Federal Energy Management Program. The authors would like to thank the DOE project managers Jefferey Murrell and Allison Ackerman for sponsoring this research and their continued support throughout the project. Special thanks go to our General Services Administration collaborators Judith Heerwagen, PhD, and Brian Gilligan for sharing their technical expertise and domain knowledge. We would like to thank Erik Mets at Pacific Northwest National Laboratory for contributing to this research and providing feedback.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.