Analysing the Space Race: A Data-Driven Look at 60+ Years of Launches
Project Overview
This project explores the evolution of space missions from 1957 onwards using the NextSpaceFlight mission dataset. Working in a Google Colab notebook as part of Angela Yu’s 100 Days of Code: Python (Day 99), I treated the exercise as a small, end-to-end data science case study: ingesting and cleaning real-world data, performing exploratory analysis, and building a concise set of visualisations that tell a clear story about launch activity, costs, and mission outcomes over time.
Core Technologies
- Python for data manipulation and analysis.
- pandas for tabular data handling, feature engineering, and grouping.
- matplotlib and plotly for static and interactive visualisations.
- Google Colab for a reproducible, cloud-based notebook workflow.
- iso3166 for mapping country names to ISO Alpha-3 codes for geospatial charts.
Data Cleaning & Preparation
The original missions dataset contained 4,324 rows and several artifacts from earlier exports. Before any analysis, I focused on making the data trustworthy and analysis-ready:
- Removed junk index columns (e.g. duplicate index fields created by CSV exports).
- Deduplicated rows after dropping those junk columns to ensure each mission was unique.
- Parsed dates from strings into proper
datetimeobjects and derivedYearandMonthfeatures. - Cleaned launch prices by stripping currency formatting, converting to numeric (USD millions), and explicitly handling missing values.
- Standardised country information by extracting the country from location strings, correcting ambiguous locations (e.g. mapping Yellow Sea to China, Gran Canaria to USA), and converting names to ISO Alpha-3 codes.
This upfront work made downstream analysis much simpler and reduced the risk of misleading figures or incorrect aggregations.
Key Analyses & Visualisations
Launch Activity by Organisation and Country
- Launches per organisation: bar charts highlighted which agencies and companies have flown the most missions over the full timeline.
- Choropleth maps: plotly choropleths (using ISO Alpha-3 codes) showed the geographic distribution of launches and failures by country, making it easy to spot historic and emerging spacefaring nations.
- Sunburst chart: a hierarchical plot of Country → Organisation → Mission Status revealed how mission outcomes vary across players and regions.
Economics of Launches
- Price distribution: histograms of launch prices (with sensible upper cutoffs to remove extreme tails) showed the typical cost ranges for missions.
- Price over time: time series charts of average launch price by year revealed how costs have evolved, with clear gaps where historical price data is missing.
- Spend by organisation: aggregations of total and average spend per organisation helped highlight who invests most heavily and how their per-launch costs compare.
Temporal Patterns & Seasonality
- Launches per year: annual counts exposed the overall growth of launch activity and key eras of acceleration.
- Month-on-month launches: a monthly time series with a rolling average smoothed short-term volatility and made longer-term trends more visible.
- Launches by month-of-year: aggregating across years showed which calendar months are historically most and least popular for
launches
Reliability & Mission Outcomes
- Mission status distribution: quick distributions of success vs failure provided a baseline view of reliability.
Cold War & Modern Competition
- USA vs USSR (Cold War era): filtered time series up to 1991 compared launches by the USA and the Soviet bloc (including launches from Kazakhstan and other relevant locations), revealing their relative activity over the Cold War period.
- Modern leaders: more recent charts looked at which countries and organisations lead in total launches and successful launches, including the rise of new players and private launch providers in 2018–2020.
Technical Highlights & Best Practices
- Question-driven EDA: each block of analysis was framed around a concrete question (e.g. “How have failure rates changed over time?”), which kept the notebook focused and readable.
- Explicit handling of missing and noisy data: rather than silently dropping rows, I made conscious decisions about when to filter, when to impute, and when to simply call out data gaps in the narrative.
- Feature engineering for clarity: deriving
Year,Month, country codes, and superpower labels (e.g. grouping USA vs USSR) made plots and tables more interpretable. - Clear, publication-ready visuals: charts are labelled with descriptive titles, axes, and legends, and use colour consistently to distinguish organisations and countries.
- Reproducible notebook workflow: the analysis runs end-to-end in a single Colab notebook, making it easy to rerun on updated datasets or adapt to related questions.
Skills Demonstrated
- End-to-end data science workflow: from raw CSV to cleaned dataset, EDA, visual storytelling, and conclusions.
- Proficient use of the Python data stack (pandas, matplotlib, plotly) in a notebook environment.
- Data cleaning and feature engineering with real-world, imperfect data.
- Exploratory visualisation and selection of appropriate chart types for temporal, categorical, and geospatial questions.
- Analytical thinking about domain questions (e.g. Cold War competition, reliability trends, seasonal patterns) backed by quantitative evidence.
Closing Thoughts
While this project was framed as a course assignment, I approached it as a practical demonstration of how I work with real datasets: start with clean, reliable data; ask clear questions; build targeted visualisations; and connect the results back to meaningful narratives. The same workflow scales to more complex data science problems, dashboards, and production analytics work.