Data Science: The 10 Commandments for Performing a Data Science Project

Hello!

Data Science: The 10 Commandments for Performing a Data Science Project Machine learning ultimately aims to build a model that performs well on unseen data. Selecting the most accurate model requires careful comparison and validation on a holdout set distinct from the one used for hyperparameter tuning. Appropriate statistical tests are also essential for reliable evaluation.

These principles will guide your next data science project in 2026. Let me know in the comments if you found them helpful—you can even share your own commandments!

Understanding stakeholder goals is vital in any data science initiative, yet it alone does not guarantee success. Data science teams must follow proven best practices to deliver on a clearly defined brief. The ten points below outline what this entails in practice.

1. Understanding the Problem

Data Science: The 10 Commandments for Performing a Data Science Project Clearly defining the problem you aim to solve is the foundation of any successful project. This means grasping the prediction target, all constraints, and the desired business or user outcome.

Ask clarifying questions early and validate your understanding with domain experts, colleagues, and end users. When their feedback aligns with your interpretation, you are on the right track.

2. Know Your Data

Deep familiarity with your data helps determine which models and features will be most effective. The nature of the data directly influences model choice and computational cost, which in turn affects project timelines and budgets.

Creating meaningful features allows models to better emulate human decision-making. Understanding each field is especially critical in regulated industries, where data is often anonymized. When in doubt, consult a domain expert.

3. Split Your Data

Data Science: The 10 Commandments for Performing a Data Science Project A model’s true value lies in its ability to generalize to new, unseen data. To assess this capability, keep a portion of the data completely hidden during training.

For supervised learning, split the dataset into training (typically 75–80 % of the original data, selected randomly) and testing sets. A separate validation set may also be required to compare models tuned on the test data.

Use scikit-learn’s train_test_split function in Python to perform this split reliably.

4. Don’t Leak Test Data

Data Science: The 10 Commandments for Performing a Data Science Project Never allow any information from the test set to influence model training. This includes seemingly minor steps such as scaling or normalizing features before splitting the data.

Premature normalization, for example, can inadvertently expose the model to the global minimum and maximum values present in the held-out test data, leading to overly optimistic performance estimates.

5. Use the Right Evaluation Metrics

Every problem demands an evaluation approach tailored to its specific context. Accuracy alone can be misleading—consider a cancer-detection model that simply predicts “not cancer” and achieves 99 % accuracy while failing to identify any actual cases.

Data Science: The 10 Commandments for Performing a Data Science Project Choose metrics that truly reflect the business or clinical objective, whether for classification or regression tasks.

6. Keep It Simple

Resist the temptation to adopt the most complex model available. Follow Occam’s Razor: select the simplest model that adequately meets performance requirements. This approach reduces training time, improves interpretability, and often yields better generalization.

7. Avoid Overfitting and Underfitting

Data Science: The 10 Commandments for Performing a Data Science Project Overfitting (high variance) occurs when a model memorizes training data and fails on new examples. Underfitting (high bias) arises when the model is too simplistic to capture underlying patterns. Balancing this bias-variance trade-off is essential for each specific problem.

8. Try Different Model Architectures

Data Science: The 10 Commandments for Performing a Data Science Project Experiment with a range of architectures—both simple and complex. For classification tasks, compare random forests with neural networks; interestingly, gradient-boosting methods frequently outperform neural networks on tabular data.

9. Tune Your Hyperparameters

Hyperparameters control the learning process (for example, the maximum depth of a decision tree). Default values are chosen for average performance across many datasets, yet problem-specific tuning often yields substantial gains. Techniques such as grid search, randomized search, and Bayesian optimization can help identify optimal settings.

Data Science: The 10 Commandments for Performing a Data Science Project

10. Comparing Models Correctly

Machine learning succeeds when models generalize well. To choose the best model, evaluate it on a holdout set that was never used during hyperparameter tuning, and apply appropriate statistical tests to confirm performance differences.

Also read:

Thank you!
Join us on social networks!
See you!

Data Science: The 10 Commandments for Performing a Data Science Project

1. Understanding the Problem

2. Know Your Data

3. Split Your Data

4. Don’t Leak Test Data

5. Use the Right Evaluation Metrics

6. Keep It Simple

7. Avoid Overfitting and Underfitting

8. Try Different Model Architectures

9. Tune Your Hyperparameters

10. Comparing Models Correctly

Subscribe to our newsletter