Show our technical findings #93
Replies: 5 comments 1 reply
-
From the readability analysis, real job posts had a balanced readability level—professional but still understandable, with moderate Flesch scores and grade levels. Human-written fake jobs, however, tended to be simpler and more emotionally persuasive, often using more accessible language. In contrast, AI-refined fake posts appeared smoother and sometimes even too polished, resulting in higher grade levels and slightly lower readability, which could be a sign of synthetic or overly optimized language. From the n-gram analysis, real job posts frequently used concrete and role-specific phrases (e.g., “project management”, “customer service”), while fake posts—especially human-written ones—focused more on general and appealing phrases (like “great opportunity”, “flexible hours”). AI-generated fake posts often used more structured corporate terms and filler language (like “responsible for ensuring”, “strong communication skills”). This dual analysis helps support the research question on detecting scams or synthetic job posts. Readability metrics reveal that AI and scam posts may deviate from natural, professional human writing—either too simple (scammy) or too smooth (AI). Meanwhile, n-gram patterns help highlight linguistic differences, especially overused buzzwords or templated phrasing, which are useful indicators of inauthentic posts. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Most Frequent Words AnalysisDatasets used
For analysis purposes, we assume that the all posts written in the Aegean dataset were human generated since it was collected between 2012-2014. The Initial HypothesesBefore testing the notebook, the hypotheses is that real and fake jobs posts use different words or maybe same words but with a significant counts difference, and the same theory applies to human-written and LLM-refined posts. The main purpose of this analysis to check if there're words that are used more frequently in each section compared to the other. The most_frequent_words.ipynb checked the columns that are usually included in an overall job description, so the columns included in this analysis are description, requirements, and benefits. Points Noticed
For description:
For requirements:
For benefits:
We can see that for the first two columns, real and fake posts are already similar and the LLM-refined version is significantly shifting from both, however, we also notice that LLM-refined does tend to mimic real posts, but only in a slight difference. Significantly, for the last one, LLM-refined version has a higher similarity range with both real and fake, and this is mainly due to the fact that benefits section usually use the same words for all types of jobs regardless of the context. Possible ErrorsI'd say the words frequency does not necessarily add justice to the hypotheses, it does support the hypotheses but in a very slight manner. I'd alternatively suggest to shed more light to N-grams since they could tell mush more! Departments and Salaries ComparisonUsed same dataset as mentioned above, except for the LLM-refined version as there's no need for it in this analysis. The Initial HypothesesAsked if maybe fake jobs tend to focus more on certain departments and industries compared to real jobs along with the salaries corresponding to those departments to test the theory saying "fake jobs promise the dream salary". The main purpose of this notebook departments_salaries_comparison.ipynb is to check the features which employers coming from both real and fake jobs tend to feed to the LLM in the first place and see if there's an actual difference. The columns used in this notebook are titles, department, industry, and function since they all relate to the same thing, and salary range to check the salary corresponding to each cluster. I applied clustering to all four columns because sometimes there're NaN values in the majority of them, and they also refer to the same concept, so it's better to cluster them into certain categories and optimal number of clusters. Points Noticed
Possible ErrorsThere were a lot of missing salary range values in fake posts. If one has time for it, then I'd suggest to use synthetic data to better mimic fake jobs and make the comparison fair enough. Final NotesFake jobs do not tend to focus on certain departments compared to real jobs, they're everywhere, however, they do promise salaries that are greater compared to real jobs for the same roles/clusters. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Beta Was this translation helpful? Give feedback.
All reactions