- Background
- The business problem I supposed to solve
- Business model
- The Challange
- First impressions of the task
- A step-by-step guide to reverse engineering (simple) psychometric tools
- I. Exploring the dimensions of the tool
- II. Reverse items
- III. Double-check the results
- Standardization in psychometrics
- Solution
- A perfect task for machines
- Model performance measurement
- Understand the business needs better
- Porters are the google translate for coders
- Understand people better
- Happy end
- What did I learn from this project?
Background
The business problem I supposed to solve
In this project, an HR startup asked me to recreate their online assessment tool’s scoring system. Why? Because they lost access to it. Also, they had test server data about thousands of possible combinations, before the launch.
Business model
The business model of the client's HR platform was easy: There were two roles:
- Jobseekers
- Employers
Jobseekers fill a survey (“a psychometric tool”) about soft skills and personality. (Almost) everybody loves personality tests, right? Then they search for better fitting positions.
Employers fill a traditional job requirement form (language, education, experience, etc.).Then they specify which kind of candidate they would like to hire, in terms of soft skills and carrier-relevant personality traits.
The Challange
First impressions of the task
“The Client needs the scoring to compute the results from the input! They have a 9999+ row database with possible variations like this:”
# | Question1 | Q2 | Q3 | Q4 | result |
test001 | 2 (Rather not true) | 1 | 3 | 4 | 0.375 |
test002 | 4 (Allways true) | 4 | 4 | 4 | 0.625 |
… | … | … | … | … | … |
test999 | 3 (Mostly true) | 4 | 1 | 2 | 0.875 |
“Easy-peasy” — I thought.
With a basic understanding of psychometrics, you can reverse engineer most of the psychometric tools (eg. personality or attitude surveys). I estimated 2-4 hours for deciphering the scoring system manually in Google Sheets…
A step-by-step guide to reverse engineering (simple) psychometric tools
I. Exploring the dimensions of the tool
Let me show a simplified example. Here are jobseeker 001’s answers to 4 questions:
Questions | jobseeker 001 | |
Q1 | I love attention. | 2 (Rather not true) |
Q2 | Usually, I'm the loudest. | 1 (Never true) |
Q3 | I’m shy. | 3 (Mostly true) |
Q4 | I hate smalltalk. | 4 (Always true) |
RESULT | 0.375 |
First, you have to figure out, that these questions are measuring the same psychometrical construct. (In this example case: Extroversion.)
- Sometimes it's easy, because of the chunked structure of the survey or the order of the questions.
- Sometimes you need a bit of domain knowledge, to figure out, that two questions try to measure something similar.
Eg.:
Q1 “I enjoy trying new foods” | A1: Always true (4) and Q2 “Try to avoid complex people.” | A2: Never true (1)
Answers to both these questions measure the so-called “Openness to experience” trait of the Big Five personality factors.
II. Reverse items
The second item (Q2) is reversed. This means (in terms of scoring) you shouldn’t just add the scores of the answers, like:
A1 + A2 = 5 4+1=5
Instead, use this simple formula:
A1 + ((N+1)-A2(1)) = 8 where N = the highest possible answer value. In our case: Always true (4). So: 4 + ((4+1)–1) = 8
So the second step should be super obvious from now:
- you have to find the reversed items if you want to recreate the scoring.
Yes, I know. It sounds very obvious, but believe me: even in high rated, otherwise-excellent Kaggle projects miss out on this step, due to a lack of domain knowledge in psychometrics!
A real example:
After this step, I thought I’m ready. I will send an excel with formulas, maybe with an explanation and that's it. I didn't expect to lose my machine learning virginity.🙈
III. Double-check the results
After double-checking the results, I realized, that scoring isn't working by the input data only. Remember: the client had a huge test server data from random inputs and outputs. Where instead of this:
A1(4) + A2r(1) = 8
I got this:
A1(4) + A2r(1) = 0.86A3(4) + A4r(1) = 0.79
A5(4) + A6r(1) = 0.94
Same inputs, different outputs! This means raw results aren’t just transformed somehow universally. They are weighted. Now what?! At this point, the whole project was filled with fear and uncertainty.
Standardization in psychometrics
To understand, why the scoring works like this, instead of just adding raw answer values, we go back to this example, where we asked Alice👧 about these questions:
👧 | Question 1 | “I enjoy trying new foods” |
👧 | Answer 1 | “Always true (4)” |
👧 | Question 2 (reversed) | “Try to avoid complex people.” |
👧 | Answer 2: | “Never true (1)” |
These questions are supposed to measure Openness to experience.
But...what if we use different questions? Let's pretend we ask Bob👨🎤 about these two:
👨🎤 | Question 1 | “I enjoy trying new foods” |
👨🎤 | Answer 1 | “Always true (4)” |
👨🎤 | Question 3 | “I took every opportunity to try new drugs, even harder ones.” |
👨🎤 | Answer 3: | “Always true (4)” |
So, if we see the raw scores, it's identical:
Both item pairs got the maximum points, however, we can suspect that these two individuals can have very different personalities in terms of “Openness to new experience”.
Before creating a psychographic tool, questions are compared. Let's imagine, that we ask these questions from every human adult on Earth, and we got these two histograms from the results:
This result means, that the answer for Q1 “never true(1)” is more rare in the population, then Q3“never true(1)”.
The conclusion is: 1 ≠ 1
(🤯 Boom!)
Solution
A perfect task for machines
Once I realized that I could not solve this task with google sheets and the manual method, I started to slowly panic. Even if I found out accidentally the formula for one subscale, there were a dozen more dimensions, each of them with different scoring. I underestimated the task, now I face a shameful fail & go back to deliver food with my bike…
OR
I can build a prediction model with machine learning! …Which I never ever done before! 😂
Why I took the risk? Well, even if I never made any machine learning model before, I'm interested in the topic and understand the very basics of the technology. Also:
- I have a lot of structured data, which seems good for train, and easy to prepare.
- I don't need to understand the exact scoring, just replicate the scoring.
- I don’t need perfect predictions, just almost perfect ones, because the scoring system uses a small number of categories for the job seekers.
So, I took a deep breath and built a very underoptimized prototype with SPSS, basically with the default settings. Here is the syntax:
USE ALL.
COMPUTE filter_$=(name="Sofskill_1").
VARIABLE LABELS filter_$ 'name="Sofskill_1" (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
*Multilayer Perceptron Network.
MLP result (MLEVEL=S) BY answer_num_1 answer_num_2 answer_num_4 answer_num_3 answer_num_5 answer_num_6 answer_num_7 answer_num_8 answer_num_9 answer_num_10
/PARTITION TRAINING=7 TESTING=3 HOLDOUT=0
/ARCHITECTURE AUTOMATIC=YES (MINUNITS=1 MAXUNITS=50)
/CRITERIA TRAINING=BATCH OPTIMIZATION=SCALEDCONJUGATE LAMBDAINITIAL=0.0000005 SIGMAINITIAL=0.00005 INTERVALCENTER=0 INTERVALOFFSET=0.5 MEMSIZE=1000
/PRINT CPS NETWORKINFO SUMMARY CLASSIFICATION
/PLOT NETWORK
/SAVE PREDVAL
/STOPPINGRULES ERRORSTEPS= 1 (DATA=AUTO) TRAININGTIMER=ON (MAXTIME=15) MAXEPOCHS=AUTO ERRORCHANGE=1.0E-4 ERRORRATIO=0.001
/MISSING USERMISSING=EXCLUDE .
And, here is the results:
On the first try, I had 89%+ accuracy! 😂
Then I called the best data scientist friend I know (Viktor Toth ❤ ) for a quick consultation to help me build the same, very basic perceptron with python, only for one soft skill dataset.
I hoped, that I can figure out the rest by myself because I read the first 2 chapters of “Automate the Boring Stuff with Python.”
- Writing an example code for one soft skill dataset took <30 min for Viktor.
- After this, took me days to figure out how the basics of the code works, and clean & slice the data properly for all (20+) separate soft-skill datasets.
Then after a couple of days, I finally succeeded. OK, now what? How to measure my model’s performance?
Model performance measurement
I used intuitively the absolute value of the subtraction of the “predicted value” and the “results” of the API test. After researching a bit about the very basics of ML model error measurement, I switched to R².
# now evaluate the model with R^2
Xtest = np.array(test_data[[f'answer_num_{n}' for n in range(1, 16)]])
ytest = np.array(test_data['result'])
print('R^2 score of the trained model ' + name_filter + ' : ', model.score(Xtest, ytest))
r2 =np.array(r2,[model.score(Xtest, ytest),name_filter])
Then I tweaked the model parameters with a primitive, but effective approach: Made a loop to train & evaluate again and again and in every iteration changed a bit about the perceptron’s parameters.
I have to admit, that I had only superficial knowledge about hidden layers or layer size, but the model evolved some %, so… ¯\_ (ツ)_/¯.
…so that’s it!
I will send the prediction part of the code, well commented.
And I also include all the .pickle files, trained to each soft skill.
Then the client implements this code to their own servers, instead of the old API calls and TADA. . Right?
Understand the business needs better
Long story short: The client wasn’t familiar with backend side python implementation.
Neither I.
There was an epic call with the CTO. Top quotes:
“I will text to the sysadmin, if there any opportunity to install the piton to the server…”
“(…) its not a vain that everybody dislikes piton”
“Let's suppose it works. But after, you should help me with this piton stuff .. handle the data to the PHP!”
A̵t̵ ̵t̵h̵i̵s̵ ̵p̵o̵i̵n̵t̵,̵ ̵I̵ ̵w̵a̵s̵ ̵s̵c̵r̵e̵a̵m̵i̵n̵g̵ ̵i̵n̵s̵i̵d̵e̵:̵ ̵̵
”̵I̵’̵m̵ ̵j̵u̵s̵t̵ ̵a̵ ̵n̵e̵r̵d̵y̵ ̵p̵s̵y̵c̵h̵o̵l̵o̵g̵i̵s̵t̵,̵ ̵h̵i̵r̵e̵d̵ ̵f̵o̵r̵ ̵d̵e̵c̵y̵p̵h̵e̵r̵ ̵a̵ ̵s̵o̵f̵t̵ ̵s̵k̵i̵l̵l̵ ̵t̵e̵s̵t̵’̵s̵ ̵s̵c̵o̵r̵i̵n̵g̵. N̵o̵t̵ ̵a̵ ̵d̵e̵v̵e̵l̵o̵p̵e̵r̵!̵ P̵l̵e̵a̵s̵e̵,̵ ̵g̵i̵v̵e̵ ̵m̵e̵ ̵a̵ ̵b̵r̵a̵k̵e̵!̵😭̵”
…*clearing throats*…
At this point, I realized:
To finish this project, I had to adapt to the client’s tech stack and understand better the client's business problem and motives.
After a quick research, I made a video presentation about the model and two possible solutions briefs (A -B), for a potential backend consultant. My goal was to validate this plan with somebody who has “developer” in his job title. And familiar with python & PHP servers.
PLAN A: The client would have a separate Flask / Django server replacing the old API. At the end of the day, they would own the whole scoring algorithm. Also, they used API for the scoring (until they lost access to it), so this solution seemed smoother in terms of implementation.
PLAN B: The client would have a JS file for every model [yaaayks] embedded in the website, which is reachable for everyone, but even easier to implement.
To my biggest surprise, they chose PLAN B.
(After a little tracing, it turned out that the startup is kicking the last ones, they are starting to run out of resources. I guess that was the last chance before investors would turn off the tap. Now choosing Plan B is much more understandable.)
Porters are the google translate for coders
The next step was a quick consultation with Viktor, who mentioned the word “porter” .
(Porters are libraries that… translate codes from one language to another).
So I dig deep into GitHub. Using this sci-kit learn porter in the end, I had a bunch of ugly, big, and fat JS files, instead of .pickels. Huuuh.
Now I can send those JS to the client, and they implement it. We are finished, right?
Understand people better
I was about to celebrate. But a long, and coldhearted email just arrived, reporting more than 20–30% error rate, in some cases even worse. This was funny because my R² scores were between 96–98%.
test sample# | MAX error | AVG error | exact match | categroy errors # | category errors % | |
my test 😊 | 3465 | 0.46 | 0.059 | 1665 | 80 | 2.4% |
client’s test 😱 | 306 | 1.47 | 0.57 | 4 | 268 | 87% |
I was shocked at first: Let's say, I had a light-yellow belt in python, but almost zero experience with JavaScript.
- So, although the testing seemed not a complicated task, I wasn’t able to test all models one by one, generate outputs, and measure error rates in JS…
- …and I had 1.5 days left until the deadline.
- This meant I wasn’t sure that the porting process from python to JS was successful.
- (Also, I was scared, what if indeed, the model is wrong, because I overfitted it?!)
After searching for the nonexistent error for days, (and abusing Viktor’s time and attention) I found out that nothing is wrong with my code, not even with the JS, but their test sample size is very small compared to mine (306 vs 3465).
I asked them for another test in bigger sample size because its almost impossible, that those error rates are valid (in the case of the porting process was successful)… and the response was:
“it takes days to perform the test with that sample size”.
At this point, I harbored dubious suspicions (maybe a bit late). My win 10 tablet trains a model in 8 minutes from thousands of data points and testing takes days. Very interesting.🤔
Remember, the CTO, who asked for an NLP model in PHP, right? He managed somehow to pick from the outliers to the test set, which was a weak spot of my model, eg:
[1,1,1,1,1,1] [1,1,1,1,1,2] [1,1,1,1,1,3] … [6,6,6,6,6,6] [6,6,6,6,6,5] [6,6,6,6,6,4]
For some reason, the neural net wasn’t good in this kind of extreme case, and for some reason, these were selected for the test set to evaluate my models. Very-very interesting.🤔
So I wrote a step-by-step deduction about these “anomalies”, very carefully and easily consumable way. With diagrams, without pointing fingers.
I sent this to the CEO one day later the deadline was extended, and the test run by them for the bigger dataset showed 96% results. Tadaaa!
Happy end
The client was very satisfied. So am I, because this was my first coding-heavy project, also this was the first time I used Machine Learning to solve a real business problem.
But the best part in it was that I “got back” my professional self-confidence (although I really enjoyed delivering food by bicycle for 3 days), this experience diverted me back to my original life track.
(Viktor doesn’t want to accept any money so I pursued a big chunk of his Steam wishlist..)
What did I learn from this project?
- coding needs empathy
- coding is addictive
- solving problems you never solved is addictive
- asking money for solving problems you never solved…is not fully responsible, but extremely fun
- technical skills & stack < domain & business knowledge < good communication
- but! technical skills are the basis of the whole, so you can't even imagine what is possible businesswise and communicate whats are your needs