# Time to completion

A common problem that we have is to estimate the time to completion of a project we are working on. Our estimates can be very inaccurate, especially when based on gut feeling. The more experience we have, the better they become, but we still can't reliably anticipate all possible ways in which the new project will be different than the ones we had before. Subdividing the project into tasks and estimating them individually can help as well, but there are many factors beyond the individual tasks that affect its duration. The importance of these factors is different on every project to the point that no two projects are alike.

We may realize that what we are trying to do is to predict what we don't know. The keyword in bold serves to give us a hint of a potential method we could use: linear or non-linear regression. If we learn that on average the addition of 1qm to a house increases its price with n dollars and this is valid every single time, then we could predict that adding 20qm would add 20*n to that price. This is linear regression in use, but it assumes linear relationship among the variables (area and price). The statistical language R, for instance, has a special function called abline, which plots the regression line for a set of data points.

We often don't have a linear relationship, but still want to examine the data. There are potentially many factors that determine the project duration and they may be related in some non-obvious ways. In our case we may have identified the following important features that influence the outcome: requirements gathering, idea generation and selection, task assignment, design, programming, testing, adaptation, client responsiveness, communication transparency, team size, project price and number of competing projects. It would be less than ideal to assume a linear relationship between any two of these features.

This blog post demonstrates a method we could use in case of non-linear or more complex relationships. To predict on new, unseen data, we need to have collected data about our past projects (preferably at least 30). The following table contains the importance of each criteria on each project, considering both our perspective and the perspective of the client. In it, the last column is called "Time to completion (days)" and it serves as a label. That label marks the outcome for that project given all previous numeric estimates.

Requirements gatheringIdea generation and selectionTask assignmentDesignProgrammingTestingAdaptationClient responsivenessCommunication transparencyTeam sizeProject priceCompeting projectsTime to completion (days)
7564508080657544554850818.0
8472817067737362723680715.0
7158719189588055354970414.0
66847672816884665151200415.0
92687484768429706041020621.0
6484687487857852803840518.0
845687619181716765410101824.0
71587775757058667459201316.0
7486748184808072754860714.5
817481907175857065564067.5
8080418481676873794890710.0
82785278847545756747501212.5
71847481586274725459101014.0
6781876592898477694970817.0
8189647673708071823820612.0
9046648093878280804940616.0
7484768686787365743780915.0
9487828491878067714820720.0
77665881876881706259301018.0
70746879727478606448201411.5
7372848480878967614780913.5
78818684908487757248401117.0
82847081918373717337801014.0
84808087837677737641070917.0
7287688287808168704890715.5
68747179848386806549401416.5
6475707080747674725920816.5
747277829384897773511101123.0
76766483919082748059701018.0
81876292779084727649801119.0

Suppose that we want to estimate the duration of the following three projects:

Requirements gatheringIdea generation and selectionTask assignmentDesignProgrammingTestingAdaptationClient responsivenessCommunication transparencyTeam sizeProject priceCompeting projectsTime to completion (days)
77566574807472657148409?
818771798577645861491013?
838462807468796265489011?

We could fit the data from the first table to the ExtraTreesRegressor and then predict on the data from the second table. This gives us the durations 15.58, 14.985 and 13.725 days accordingly. We could also see the most important features, which in our case are project price, programming, design and number of competing projects, ordered from most to least important. They have the highest weights in determining the project duration.

If we also wanted to check which of these projects would be most beneficial to work on, this is also possible in a single line of code. After we divide the project prices by the team sizes and the result by the estimated time to completion, we get that each person on the team would get on average 13.478819, 15.14848182 and 16.17486339 {currency unit}/day of work. This makes the last project most suitable for selection.