Python Script for Scraping a Public Job Board

Overview

This Python script scrapes job postings from a public job board and prepares the data for a machine learning project. It utilizes web scraping techniques to extract job titles, links, descriptions, and date posted from the job board's HTML structure.

Features

Scraping Job Postings: Uses BeautifulSoup library to parse HTML and extract job details.
Data Cleaning: Handles text cleaning tasks such as removing HTML tags and formatting text.
Data Structuring: Organizes extracted data into a structured format using pandas DataFrame.
CSV Export: Saves the scraped data into a CSV file for further analysis or machine learning model training.

Character Limitation

The data in the 'Content' column has been limited to a maximum of 10000 characters.

Libraries Used

BeautifulSoup: For parsing HTML and navigating the DOM tree of the job board.
pandas: For data manipulation and structuring the scraped data into a DataFrame.
requests: For making HTTP requests to fetch the HTML content of the job board.

Workflow

Fetch HTML Content: Utilizes the requests library to retrieve the HTML content of the job board page.
Parse HTML: Uses BeautifulSoup to parse the HTML content and extract relevant job details.
Data Extraction: Extracts job titles, links, descriptions, and date posted from the parsed HTML.
Data Cleaning: Cleans extracted text data, such as removing unnecessary HTML tags, newline characters, or extra spaces.
Data Structuring: Constructs a pandas DataFrame to store the cleaned and structured job data.
CSV Export: Saves the DataFrame to a CSV file, ensuring all data is properly formatted and ready for analysis or machine learning tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
Supplementary A - web_scraping_to_retrieve_unlabeled_data.ipynb		Supplementary A - web_scraping_to_retrieve_unlabeled_data.ipynb
Supplementary B - self_training_to_label.ipynb		Supplementary B - self_training_to_label.ipynb
Supplementary C - recategorizing_of_labeled_data.ipynb		Supplementary C - recategorizing_of_labeled_data.ipynb
job_category_prediction_model.ipynb		job_category_prediction_model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Script for Scraping a Public Job Board

Overview

Features

Character Limitation

Libraries Used

Workflow

About

Releases

Packages

Contributors 2

Languages

cyborgEneki/job-board-ml-project

Folders and files

Latest commit

History

Repository files navigation

Python Script for Scraping a Public Job Board

Overview

Features

Character Limitation

Libraries Used

Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages