🕸️

Reverse engineering a psychometric scoring system with simple neural networks

☝

Disclaimer: This writing is a detailed case study (15+mins to read), a guide, and personal sharing at the same time about a very intense project in 2019. I wrote this level of detail to give a chance to fully understand my thinking process. Feel free to navigate within the table of contents:

Background
The business problem I supposed to solve
Business model
The Challange
First impressions of the task
A step-by-step guide to reverse engineering (simple) psychometric tools
I. Exploring the dimensions of the tool
II. Reverse items
III. Double-check the results
Standardization in psychometrics
Solution
A perfect task for machines
Model performance measurement
Understand the business needs better
Porters are the google translate for coders
Understand people better
Happy end
What did I learn from this project?

Background

The business problem I supposed to solve

In this project, an HR startup asked me to recreate their online assessment tool’s scoring system. Why? Because they lost access to it. Also, they had test server data about thousands of possible combinations, before the launch.

Business model

The business model of the client's HR platform was easy: There were two roles:

Jobseekers
Employers

(UI example & questions are just illustrations. Hopefully, nobody will try to assess candidates with this kind of question.)

Jobseekers fill a survey (“a psychometric tool”) about soft skills and personality. (Almost) everybody loves personality tests, right? Then they search for better fitting positions.

Employers fill a traditional job requirement form (language, education, experience, etc.).Then they specify which kind of candidate they would like to hire, in terms of soft skills and carrier-relevant personality traits.

The Challange

First impressions of the task

“The Client needs the scoring to compute the results from the input! They have a 9999+ row database with possible variations like this:”

#	Question1	Q2	Q3	Q4	result
test001	2 (Rather not true)	1	3	4	0.375
test002	4 (Allways true)	4	4	4	0.625
…	…	…	…	…	…
test999	3 (Mostly true)	4	1	2	0.875

“Easy-peasy” — I thought.

With a basic understanding of psychometrics, you can reverse engineer most of the psychometric tools (eg. personality or attitude surveys). I estimated 2-4 hours for deciphering the scoring system manually in Google Sheets…

A step-by-step guide to reverse engineering (simple) psychometric tools

I. Exploring the dimensions of the tool

Let me show a simplified example. Here are jobseeker 001’s answers to 4 questions:

Questions		jobseeker 001
Q1	I love attention.	2 (Rather not true)
Q2	Usually, I'm the loudest.	1 (Never true)
Q3	I’m shy.	3 (Mostly true)
Q4	I hate smalltalk.	4 (Always true)
RESULT		0.375

First, you have to figure out, that these questions are measuring the same psychometrical construct. (In this example case: Extroversion.)

Sometimes it's easy, because of the chunked structure of the survey or the order of the questions.
Sometimes you need a bit of domain knowledge, to figure out, that two questions try to measure something similar.

Eg.:

Q1 “I enjoy trying new foods” | A1: Always true (4) and Q2 “Try to avoid complex people.” | A2: Never true (1)

Answers to both these questions measure the so-called “Openness to experience” trait of the Big Five personality factors.

II. Reverse items

The second item (Q2) is reversed. This means (in terms of scoring) you shouldn’t just add the scores of the answers, like:

A1 + A2 = 5 4+1=5

Instead, use this simple formula:

A1 + ((N+1)-A2(1)) = 8 where N = the highest possible answer value. In our case: Always true (4). So: 4 + ((4+1)–1) = 8

So the second step should be super obvious from now:

- you have to find the reversed items if you want to recreate the scoring.

Yes, I know. It sounds very obvious, but believe me: even in high rated, otherwise-excellent Kaggle projects miss out on this step, due to a lack of domain knowledge in psychometrics!

A real example:

“would make more sense” is a very humble way to tell: the whole clustering is meaningless without the correction. source: Kaggle

After this step, I thought I’m ready. I will send an excel with formulas, maybe with an explanation and that's it. I didn't expect to lose my machine learning virginity.🙈

III. Double-check the results

After double-checking the results, I realized, that scoring isn't working by the input data only. Remember: the client had a huge test server data from random inputs and outputs. Where instead of this:

A1(4) + A2r(1) = 8

I got this:

A1(4) + A2r(1) = 0.86
A3(4) + A4r(1) = 0.79
A5(4) + A6r(1) = 0.94

Same inputs, different outputs! This means raw results aren’t just transformed somehow universally. They are weighted. Now what?! At this point, the whole project was filled with fear and uncertainty.

Standardization in psychometrics

To understand, why the scoring works like this, instead of just adding raw answer values, we go back to this example, where we asked Alice👧 about these questions:

👧	*Question 1*	“I enjoy trying new foods”
👧	*Answer 1*	“Always true (4)”

👧	Question 2 (reversed)	“Try to avoid complex people.”
👧	Answer 2:	“Never true (1)”

These questions are supposed to measure Openness to experience.

But...what if we use different questions? Let's pretend we ask Bob👨‍🎤 about these two:

👨‍🎤	Question 1	“I enjoy trying new foods”
👨‍🎤	Answer 1	“Always true (4)”

👨‍🎤	Question 3	“I took every opportunity to try new drugs, even harder ones.”
👨‍🎤	Answer 3:	“Always true (4)”

So, if we see the raw scores, it's identical:

❕

👧A1+A2r=8 & 👨‍🎤 A1+A3=8

Both item pairs got the maximum points, however, we can suspect that these two individuals can have very different personalities in terms of “Openness to new experience”.

Before creating a psychographic tool, questions are compared. Let's imagine, that we ask these questions from every human adult on Earth, and we got these two histograms from the results:

This result means, that the answer for Q1 “never true(1)” is more rare in the population, then Q3“never true(1)”.

The conclusion is: 1 ≠ 1

(🤯 Boom!)

Solution

A perfect task for machines

Once I realized that I could not solve this task with google sheets and the manual method, I started to slowly panic. Even if I found out accidentally the formula for one subscale, there were a dozen more dimensions, each of them with different scoring. I underestimated the task, now I face a shameful fail & go back to deliver food with my bike…

I can build a prediction model with machine learning! …Which I never ever done before! 😂

Why I took the risk? Well, even if I never made any machine learning model before, I'm interested in the topic and understand the very basics of the technology. Also:

I have a lot of structured data, which seems good for train, and easy to prepare.
I don't need to understand the exact scoring, just replicate the scoring.
I don’t need perfect predictions, just almost perfect ones, because the scoring system uses a small number of categories for the job seekers.

So, I took a deep breath and built a very underoptimized prototype with SPSS, basically with the default settings. Here is the syntax:

USE ALL. 
COMPUTE filter_$=(name="Sofskill_1"). 
VARIABLE LABELS filter_$ 'name="Sofskill_1" (FILTER)'. 
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. 
FORMATS filter_$ (f1.0). 
FILTER BY filter_$. 
EXECUTE. 
*Multilayer Perceptron Network.
MLP result (MLEVEL=S) BY answer_num_1 answer_num_2 answer_num_4 answer_num_3 answer_num_5 answer_num_6 answer_num_7 answer_num_8 answer_num_9 answer_num_10 
/PARTITION TRAINING=7 TESTING=3 HOLDOUT=0 
/ARCHITECTURE AUTOMATIC=YES (MINUNITS=1 MAXUNITS=50) 
/CRITERIA TRAINING=BATCH OPTIMIZATION=SCALEDCONJUGATE LAMBDAINITIAL=0.0000005 SIGMAINITIAL=0.00005 INTERVALCENTER=0 INTERVALOFFSET=0.5 MEMSIZE=1000 
/PRINT CPS NETWORKINFO SUMMARY CLASSIFICATION 
/PLOT NETWORK 
/SAVE PREDVAL 
/STOPPINGRULES ERRORSTEPS= 1 (DATA=AUTO) TRAININGTIMER=ON (MAXTIME=15) MAXEPOCHS=AUTO ERRORCHANGE=1.0E-4 ERRORRATIO=0.001 
/MISSING USERMISSING=EXCLUDE .

softskill_scoring/spss_perceptron.sav at 4ac90482a87cbd31f7a2a6c1822004ac96274348 · k0-ba/softskill_scoring

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters You can't perform that action at this time. You signed in with another tab or window.

github.com

softskill_scoring/spss_perceptron.sav at 4ac90482a87cbd31f7a2a6c1822004ac96274348 · k0-ba/softskill_scoring

And, here is the results:

On the first try, I had 89%+ accuracy! 😂

Then I called the best data scientist friend I know (Viktor Toth ❤ ) for a quick consultation to help me build the same, very basic perceptron with python, only for one soft skill dataset.

I hoped, that I can figure out the rest by myself because I read the first 2 chapters of “Automate the Boring Stuff with Python.”

The overconfidence train has no stops.

Writing an example code for one soft skill dataset took <30 min for Viktor.
After this, took me days to figure out how the basics of the code works, and clean & slice the data properly for all (20+) separate soft-skill datasets.

Then after a couple of days, I finally succeeded. OK, now what? How to measure my model’s performance?

Model performance measurement

I used intuitively the absolute value of the subtraction of the “predicted value” and the “results” of the API test. After researching a bit about the very basics of ML model error measurement, I switched to R².

# now evaluate the model with R^2
Xtest = np.array(test_data[[f'answer_num_{n}' for n in range(1, 16)]])
ytest = np.array(test_data['result'])
print('R^2 score of the trained model ' + name_filter + ' : ', model.score(Xtest, ytest))
r2 =np.array(r2,[model.score(Xtest, ytest),name_filter])

Then I tweaked the model parameters with a primitive, but effective approach: Made a loop to train & evaluate again and again and in every iteration changed a bit about the perceptron’s parameters.

I have to admit, that I had only superficial knowledge about hidden layers or layer size, but the model evolved some %, so… ¯\_ (ツ)_/¯.

…so that’s it!

I will send the prediction part of the code, well commented.

And I also include all the .pickle files, trained to each soft skill.

Here is my face when my code runs at the first time as I think it should run, training the models, generating the .pickle-s and printing out the R² like a swiss factory

Then the client implements this code to their own servers, instead of the old API calls and TADA. . Right?

Nooooooo, the project ain’t over yet! 😈

Understand the business needs better

Long story short: The client wasn’t familiar with backend side python implementation.

Neither I.

There was an epic call with the CTO. Top quotes:

“I will text to the sysadmin, if there any opportunity to install the piton to the server…”

“(…) its not a vain that everybody dislikes piton”

“Let's suppose it works. But after, you should help me with this piton stuff .. handle the data to the PHP!”

A̵t̵ ̵t̵h̵i̵s̵ ̵p̵o̵i̵n̵t̵,̵ ̵I̵ ̵w̵a̵s̵ ̵s̵c̵r̵e̵a̵m̵i̵n̵g̵ ̵i̵n̵s̵i̵d̵e̵:̵ ̵̵

”̵I̵’̵m̵ ̵j̵u̵s̵t̵ ̵a̵ ̵n̵e̵r̵d̵y̵ ̵p̵s̵y̵c̵h̵o̵l̵o̵g̵i̵s̵t̵,̵ ̵h̵i̵r̵e̵d̵ ̵f̵o̵r̵ ̵d̵e̵c̵y̵p̵h̵e̵r̵ ̵a̵ ̵s̵o̵f̵t̵ ̵s̵k̵i̵l̵l̵ ̵t̵e̵s̵t̵’̵s̵ ̵s̵c̵o̵r̵i̵n̵g̵. N̵o̵t̵ ̵a̵ ̵d̵e̵v̵e̵l̵o̵p̵e̵r̵!̵ P̵l̵e̵a̵s̵e̵,̵ ̵g̵i̵v̵e̵ ̵m̵e̵ ̵a̵ ̵b̵r̵a̵k̵e̵!̵😭̵”

…*clearing throats*…

At this point, I realized:

To finish this project, I had to adapt to the client’s tech stack and understand better the client's business problem and motives.

After a quick research, I made a video presentation about the model and two possible solutions briefs (A -B), for a potential backend consultant. My goal was to validate this plan with somebody who has “developer” in his job title. And familiar with python & PHP servers.

PLAN A: The client would have a separate Flask / Django server replacing the old API. At the end of the day, they would own the whole scoring algorithm. Also, they used API for the scoring (until they lost access to it), so this solution seemed smoother in terms of implementation.

PLAN B: The client would have a JS file for every model [yaaayks] embedded in the website, which is reachable for everyone, but even easier to implement.

To my biggest surprise, they chose PLAN B.

(After a little tracing, it turned out that the startup is kicking the last ones, they are starting to run out of resources. I guess that was the last chance before investors would turn off the tap. Now choosing Plan B is much more understandable.)

Porters are the google translate for coders

The next step was a quick consultation with Viktor, who mentioned the word “porter” .

(Porters are libraries that… translate codes from one language to another).

So I dig deep into GitHub. Using this sci-kit learn porter in the end, I had a bunch of ugly, big, and fat JS files, instead of .pickels. Huuuh.

Now I can send those JS to the client, and they implement it. We are finished, right?

Wooo-woo-woo! Not that fast!

Understand people better

I was about to celebrate. But a long, and coldhearted email just arrived, reporting more than 20–30% error rate, in some cases even worse. This was funny because my R² scores were between 96–98%.

	test sample#	MAX error	AVG error	exact match	categroy errors #	category errors %
my test 😊	3465	0.46	0.059	1665	80	2.4%
client’s test 😱	306	1.47	0.57	4	268	87%

I was shocked at first: Let's say, I had a light-yellow belt in python, but almost zero experience with JavaScript.

So, although the testing seemed not a complicated task, I wasn’t able to test all models one by one, generate outputs, and measure error rates in JS…

…and I had 1.5 days left until the deadline.

This meant I wasn’t sure that the porting process from python to JS was successful.
(Also, I was scared, what if indeed, the model is wrong, because I overfitted it?!)

the error (if on my side) could have been every step

After searching for the nonexistent error for days, (and abusing Viktor’s time and attention) I found out that nothing is wrong with my code, not even with the JS, but their test sample size is very small compared to mine (306 vs 3465).

I asked them for another test in bigger sample size because its almost impossible, that those error rates are valid (in the case of the porting process was successful)… and the response was:

“it takes days to perform the test with that sample size”.

At this point, I harbored dubious suspicions (maybe a bit late). My win 10 tablet trains a model in 8 minutes from thousands of data points and testing takes days. Very interesting.🤔

Remember, the CTO, who asked for an NLP model in PHP, right? He managed somehow to pick from the outliers to the test set, which was a weak spot of my model, eg:

[1,1,1,1,1,1] [1,1,1,1,1,2] [1,1,1,1,1,3] … [6,6,6,6,6,6] [6,6,6,6,6,5] [6,6,6,6,6,4]

For some reason, the neural net wasn’t good in this kind of extreme case, and for some reason, these were selected for the test set to evaluate my models. Very-very interesting.🤔

So I wrote a step-by-step deduction about these “anomalies”, very carefully and easily consumable way. With diagrams, without pointing fingers.

I sent this to the CEO one day later the deadline was extended, and the test run by them for the bigger dataset showed 96% results. Tadaaa!

Happy end

The client was very satisfied. So am I, because this was my first coding-heavy project, also this was the first time I used Machine Learning to solve a real business problem.

But the best part in it was that I “got back” my professional self-confidence (although I really enjoyed delivering food by bicycle for 3 days), this experience diverted me back to my original life track.

(Viktor doesn’t want to accept any money so I pursued a big chunk of his Steam wishlist..)

What did I learn from this project?

coding needs empathy
coding is addictive
solving problems you never solved is addictive
asking money for solving problems you never solved…is not fully responsible, but extremely fun
technical skills & stack < domain & business knowledge < good communication
but! technical skills are the basis of the whole, so you can't even imagine what is possible businesswise and communicate whats are your needs