
Analysis of the Job Market for Data Related Positions
For the code, please visit: https://github.com/aeabdou/Data-Job-Market-Insights-/tree/main
1. Introduction
The field of data science and analytics is continually evolving, with the demand for expertise in this domain witnessing an unprecedented surge. As businesses recognize the power of data-driven decision-making, there is a growing need to understand the dynamics of the job market catering to this domain. Our study delves into the multifaceted aspects of the job market for data science and analysis positions. By analyzing job postings, salary distributions, required qualifications, and geographical distributions, we aim to provide a comprehensive overview of current trends and insights. Moreover, by integrating H1B visa data, we seek to understand the U.S. job landscape for international data science students, highlighting the bridge between domestic and international talent acquisition.
2. Hypotheses
-
Educational qualifications play a significant role in determining the salary bracket of data science and analytics positions.
-
Job locations, especially in tech hubs, have a positive correlation with higher salary offerings.
-
Companies in certain sectors, like Information Technology and Finance, are more likely to sponsor H1B visas for data science roles.
3. Data Collection and Overview
We got our main data from Kaggle.com, which had info on job postings from the Indeed website. This gives us a lot of details about data science jobs.
We have details on 4,000 job postings, with 20 different pieces of info for each job. This gives us a lot to study and understand.
4. Analysis
4.1 Salary Analysis
Salary Analysis by Job title and degree level
When the Job titles were grouped together into a broader set of job titles, there were some that didn’t fit any title categories, so they were named “Other”.


The plot above shows consistently higher mean salaries for ML/AI and Data Science jobs across the three different degree levels. Also, there is a consistently lower mean salary for quantitative and Business intelligence jobs across the three different degree levels. This could be investigated further using ANOVA.

-
Degree's Influence on Salary: There's a statistically significant difference in average salaries based on the educational degree someone holds. So, the level of education (BS, MS, PhD) can impact one's salary.
-
Job Title's Influence on Salary: Different job titles or roles also show statistically significant differences in average salaries. This means the type of job someone has plays a role in determining their salary.
-
Interaction between Degree and Job Title: The interaction between the educational degree and job title is also statistically significant. This implies that the influence of one's degree on their salary might vary depending on the job title one holds, and vice versa.
Salary analysis by Job Location and degree level
I was found out where most of these jobs are. This helps people know where they might move for a job or where companies are hiring the most. The plot below shows the mean salary for each of the ten states by different degree levels.

ANOVA Results:

-
State's Influence on Salary: There's a statistically significant difference in average salaries among different states. This means that the state in which a job is located plays a crucial role in determining the salary.
-
Degree's Influence on Salary within States: Within individual states, there isn't a statistically significant difference in average salaries based on the educational degree one holds. So, within a particular state, holding different educational degrees (BS, MS, PhD) doesn't necessarily mean one would earn a significantly different salary.
-
Interaction between State and Degree: The relationship between the state and degree is statistically significant, indicating an interaction effect. This suggests that the impact of one's educational degree on salary might vary depending on the state they are in, and the influence of the state on salary might differ based on the educational degree held.
4.2. Sector Analysis
Count of data jobs for each sector

From the plot:
It is clear that Information hires the greatest number of data-related jobs, followed by Business Services, Finance, and healthcare sectors.Mining, Agriculture, Travel & Tourism, and Transportation sectors seem not to hire a lot of Data professionals.
The plot below shows the mean salaries for each sector.

The construction sector appears to be the highest paying among others. However, looking at the sector distributions from the previous plot, this sector has a very small number of data-related jobs. This might be because there are very few high-paying data jobs in the construction sector.
ANOVA Results:

Tukey's Honestly Significant Difference (HSD)
HSD test was done to make pairwise comparisons between the groups. This post-hoc test helps identify which specific groups differ from each other after a significant ANOVA result.
The tables below include each group vs Information technology. The IT group was chosen specifically to compare the groups with, because it has the highest combination of mean salaries and job counts.
Significant Differences:
The table shows the sectors with statistically significant differences in mean salaries when compared to the "Information Technology" sector. The positive "Mean Difference" values indicate that "Information Technology" has higher average salaries than the sectors listed in "Group 1". Specifically:

Interpretation:
The table demonstrates that the "Consumer Services & Retail", "Financial & Legal Services", "Health & Pharmaceuticals", and "Public & Non-Profit" sectors have mean salaries that are significantly lower than the "Information Technology" sector. Specifically:
-
The "Consumer Services & Retail" sector's mean salary is lower by approximately $15,540.
-
The "Financial & Legal Services" sector's mean salary is reduced by about $14,840.
-
The "Health & Pharmaceuticals" sector's mean salary is down by roughly $14,450.
-
The "Public & Non-Profit" sector's mean salary is diminished by around $21,738.
The consistent p-values of 0.001 across these sectors emphasize the statistical significance of these disparities, indicating that these differences are very unlikely to be due to random chance.
Non-significant Differences:
For observed differences in mean salaries, when compared to the "Information Technology" sector, are not statistically significant are not included. Despite the apparent differences in mean salaries, we don't have enough statistical evidence to say that these disparities are genuine and not just due to random variation. The sectors in this category include "Agriculture", "Energy & Mining", "Entertainment & Media", "Entertainment, Arts & Tourism", "Manufacturing & Construction", and "Transportation & Logistics". The higher p-values for these sectors, ranging from 0.4020 to 0.9000, reflect the lack of statistical significance.
4.3 Most used Technologies Analysis


Technologies mentioned vs job Title

5. Salary Prediction
A neural network was used to predict a job's salary category based on other job details. This section is an explanation of how the network was built and how well it works.
Feature Engineering and Data Preparation:
1.Job Title Classification:To make the job titles more manageable and to categorize them into broader groups, we grouped them based on certain keywords. For instance:
-
Titles containing words like 'analytic', 'analysis', or 'data analyst' were labeled as 'Analytics'.
-
Titles with 'machine learning', 'ml', or 'ai' were categorized as 'ML/AI'.
-
'Senior' or 'sr.' keywords led to the 'Senior Jobs' label. ... and so on for other titles.
2.Salary Range Categorization (Y):
The salary data, represented as 'mean_salary', was segmented into five distinct categories to simplify the prediction process:
-
Below 70: Category 0
-
70 to 89: Category 1
-
90 to 119: Category 2
-
120 to 149: Category 3
-
150 and above: Category 4
3.Sector Grouping:
The sectors were further grouped into broader categories to simplify the analysis. This was based on the nature of the sectors. For example:
-
'Information Technology' and 'Telecommunications' were grouped under 'Information Technology'.
-
'Insurance', 'Finance', 'Accounting & Legal', and 'Real Estate' were collectively termed 'Financial & Legal Services'. ... and so on for other sectors.
Data Cleaning:
1.Dropping Unneeded Columns:
We started by removing columns that weren't essential for our analysis. This included columns like 'Revenue', 'Job Description', 'Company Name', 'Salary_Estimate', and several others. By removing these columns, we could focus on the most relevant data for our neural network model.
2.Handling Missing Values:
In the dataset, missing values were represented as -1 or '-1'. We replaced these with NaN (Not a Number) which is a standard way to denote missing data in Python.
After this change, we looked at the percentage of missing values in each column. Some columns like 'Rating' and 'Founded' had missing values, while others like 'Size' and 'Sector_Int' had even more missing data.
3.Filling Missing Values:
-
For numerical columns like 'Rating' and 'Founded', we filled the missing values with the average value of that column. This way, we weren't adding any new outlier values but rather keeping the data consistent.
-
For categorical columns like 'Size' and 'Sector_Int', we initially filled the missing values with the label 'Unknown'. However, for the 'Sector_Int' column, we later replaced 'Unknown' with the average value of the column after rounding it to ensure it fits the existing category values.
4.Final Check:
After all these cleaning steps, we checked again for missing values. This helped confirm that our cleaning steps worked and that our data was now ready for further analysis and modeling.
With the data cleaned and features engineered, it was now prepped for input into our neural network model. This process ensures that our model gets the best quality data, which can significantly improve its predictions.
Building the Neural Network:
1.Data Splitting:
Before feeding data into the neural network, we split our dataset into two parts:
-
Training set: This data is used to train the neural network. It helps the model learn the patterns in the data.
-
Test set: This data is kept separate and is used later to test how well our model is working.
We used 80% of the data for training and kept 20% for testing.
2. Neural Network Architecture:

A multi-layer neural network was built using the following structure:
-
First layer: 500 neurons with a 'relu' activation function. 'Relu' stands for Rectified Linear Activation, which is a popular choice because it helps the model learn complex patterns.
-
Second layer: 200 neurons with a 'relu' activation function.
-
Third layer: 80 neurons with a 'relu' activation function.
-
Output layer: 5 neurons with a 'linear' activation function. We have 5 neurons because we're predicting 5 different salary categories.
3.Model Compilation:
After setting up the structure, we had to decide how the model should learn. For this:
-
A loss function called 'SparseCategoricalCrossentropy'. This function measures how well the model's predictions match the real data, especially when the data has multiple categories like ours.
-
The 'Adam' optimizer with a learning rate of 0.001. This optimizer helps update the model based on the data it sees, making it learn better over time.
4.Training the Model:
With everything set up, we started training our neural network. We passed the training data through it 500 times (called epochs). With each pass, the model got better at making predictions.
Once the model is trained, it can be used to predict salary categories for new job listings. It's important to note that neural networks like this one can be very powerful, but they also need a lot of data and fine-tuning to work well.
Assessing Neural Network Performance:
1.Predicting on Training Data:
We used our trained neural network to make predictions on the training data. This means we're asking the model to guess the salary category for the job listings it has already seen during training. The future drafts will have a confusion matrix on validation and test data set.
Confusion Matrix:

Analyzing the confusion matrix:
-
Category 3 (120 to 149): With 392 correct predictions and relatively fewer misclassifications across other categories, the model performed best in predicting jobs with salaries in this range.
-
Category 0 (Below 70): The model had 331 correct predictions for this category, with misclassifications mostly leaning towards the adjacent salary range (Category 1).
-
Category 1 (70 to 89): With 254 correct predictions, the model performed decently for this category, although it had some confusion with Category 0.
-
Category 2 (90 to 119): The model correctly predicted 325 jobs in this salary range. However, it showed a spread of misclassifications across all other categories, indicating some level of confusion.
-
Category 4 (150 and above): The model had 239 correct predictions but showed notable misclassifications, especially towards Category 3 (the adjacent lower salary range). This indicates that the model found it slightly more challenging to distinguish the highest salary category from others.
In summary, the model was most effective at predicting Category 3 (120 to 149): and least effective at Category 4 (150 and above), with performance for the other categories falling in between.
6. H1B Sponsorship Data Analysis
We also looked at data about hiring from other countries. This helps us understand if companies are looking outside the US for data science experts.
It's important to note that this section is still under development. Comprehensive data spanning multiple years is essential for a more nuanced understanding of U.S. sponsorship trends for data-related roles. I
Basic Visualizations:
-
The plot below shows the ranking of the top ten states that sponsored data professionals.

-
The plot below shows the ranking of the bottom ten states that sponsored data professionals.

-
The plot below shows the ranking of the top ten Cities that sponsored data professionals.

-
The plot below shows the ranking of the top ten Companies that sponsored data professionals.

7. Conclusion
Our analysis offers a detailed snapshot of the current state of the job market for data science and analytics positions. From understanding the pivotal role of educational qualifications in determining salary levels to recognizing the geographical hotspots of tech roles, we've garnered valuable insights. Furthermore, our preliminary exploration into H1B visa data underscores the growing trend of sourcing international talent in the data domain. While our neural network-based salary predictions pave the way for future enhancements, our study serves as a foundational step towards a more inclusive and detailed understanding of the data science job landscape in the U.S.
8. Recommendations
-
For Job Seekers: Tailoring skillsets to match industry demands can lead to better job prospects. Being open to relocation, especially to tech hubs, can significantly enhance salary prospects.
-
For Employers: Investing in training programs can bridge the skill gap, allowing companies to mold talent as per their specific needs. Moreover, looking beyond borders and considering international talent can address the growing demand for data expertise.
-
For Educational Institutions: Introducing more practical, industry-aligned courses can better prepare students for the real-world challenges of data roles.
9. Future Steps
-
Neural Network Enhancement: While our current neural network provides valuable insights, there's room for improvement. By refining the architecture and incorporating more diverse data, we aim to enhance its predictive accuracy, especially for new, unseen data.
-
Expanding H1B Data Analysis: To offer a more comprehensive view of the U.S. job landscape for international data science students, we intend to source and analyze H1B visa data from multiple years. This will allow us to identify trends and make more accurate forecasts.
10. Appendix
