A clean and well-structured data analysis project exploring IMDB Top 250 Movies data using Python, Pandas, NumPy, Seaborn, and Matplotlib.
This notebook reveals insights into movie ratings, genres, budgets, box office collections, runtimes, and directors — all presented through clear, engaging visualizations.
This project aims to analyze patterns and relationships in IMDB movie data.
Through data cleaning, transformation, and visualization, it helps answer questions such as:
- Which movies have the highest ratings?
- Which genres and directors are most popular?
- How do budget and box office earnings relate?
- Is there a correlation between runtime and rating?
- What are the most common certificate categories?
Each step is written to be beginner-friendly and reproducible in Google Colab or Jupyter Notebook.
- Introduction – Overview of goals and dataset.
- Imports & Display Settings – Load libraries and set visual themes.
- Load & Preview Data – Import and inspect the dataset.
- Helper Cleaning Functions – Convert runtime and money formats into numeric types.
- Apply Cleaning & Normalize Columns – Standardize data types and extract key features.
- Drop Incomplete Records – Remove rows with missing essential data.
- Analysis 1: Top 10 Movies by Rating (Horizontal Bar)
- Analysis 2: Distribution of Ratings (Histogram + KDE)
- Analysis 3: Top Years with Most Movies (Vertical Bar)
- Analysis 4: Rating Distribution by Genre (Boxplot)
- Analysis 5: Budget vs Box Office (Scatter with Log Scale)
- Analysis 6: Runtime vs Rating (Scatter + Regression)
- Analysis 7: Top Directors by Number of Movies (Horizontal Bar)
- Analysis 8: Correlation Heatmap (Numeric Features)
- Analysis 9: Certificate Distribution (Pie Chart)
- Conclusions Compact interpretation.
- Comprehensive data cleaning and transformation.
- Visual exploration with 9 key analyses.
- Helper functions for parsing monetary and runtime values.
- Insights on relationships between budget, revenue, and ratings.
- Ready-to-run on Google Colab.
The notebook expects a CSV file named movies.csv with the following columns:
| Column Name | Description |
|---|---|
| rank | IMDB ranking number |
| name | Movie title (required) |
| year | Release year |
| rating | IMDB rating |
| genre | Comma-separated genres |
| certificate | Film classification (e.g., PG, R, G) |
| run_time | e.g., 2h 22min or 142 min |
| tagline | Short tagline (if available) |
| budget | e.g., $12M, €300K |
| box_office | e.g., $100M, £500K |
| casts | Main cast members |
| directors | Comma-separated names |
| writers | Comma-separated names |
📝 If your CSV filename or path differs, update it in the pd.read_csv() function.
- Open the notebook
- Upload your
movies.csvfile to Colab’s/content/directory. - Install required libraries:
pip install pandas numpy matplotlib seaborn
- Run all cells sequentially.
- View the output charts and insights directly in Colab.
- Top 10 Movies by Rating
- Rating Distribution (Histogram + KDE)
- Top Years for Movie Releases
- Genre-wise Rating Boxplot
- Budget vs Box Office (Scatter + Regression)
- Runtime vs Rating (Regression Plot)
- Top Directors by Number of Movies
- Numeric Correlation Heatmap
- Certificate Distribution Pie Chart
Each visualization provides valuable insights into movie data patterns.
- Most highly rated movies come from Drama and Action genres.
- Budgets and box office earnings show a moderate positive correlation.
- Longer movies tend to have slightly higher ratings.
- Certain directors dominate the top movie list.
- The majority of movies fall into a few key certificate categories.
This project is open source and available under the MIT License.
Ali Husnain Shah