• 03.11.2022 09:30

    Data Science: The 10 Commandments for Performing a Data Science Project

    News image


    Machine learning has the ultimate goal of creating a model that is generalizable. It is important to select the most accurate model by comparing and choosing it correctly. You will need a different holdout than the one you used to train your hyperparameters. You will also need to use statistical tests that are appropriate to evaluate the results.

    These are the guiding principles that will guide you in your next data science project. Let me know if you found them helpful or not. You can add your own commandments to the comments!

    It is crucial to understand the goals of the users or participants in a data science project. However, this does not guarantee success. Data science teams must adhere to best practices when executing a project in order to deliver on a clearly defined brief. These ten points can be used to help you understand what it means.

    1. Understanding the Problem

    Knowing the problem you are trying to solve is the most important part of solving it. You must understand the problem you are trying to predict, all constraints, and the end goal of this project.

    Ask questions early and validate your understanding with domain experts, peers, and end-users. If the answers you receive align with your understanding, then you are on a right track.

    2. Know Your Data

    Knowing what your data means will help you understand which models are most effective and which features to use. The data problem will determine which model is most successful. Also, the computational time will impact the project’s cost.

    You can improve or mimic human decision-making by using and creating meaningful features. It is crucial to understand the meaning of each field, especially when it comes to regulated industries where data may be anonymized and not clear. If you’re unsure what something means, consult a domain expert.

    3. Split your data

    What will your model do with unseen data? If your model can’t adapt to new data, it doesn’t matter how good it does with the data it is given.

    You can validate its performance on unknown data by not letting the model see any of it while training. This is essential in order to choose the right model architecture and tuning parameters for the best performance.

    Splitting your data into multiple parts is necessary for supervised learning. The training data is the data the model uses to learn. It typically consists of 75-80% of the original data.

    This data was chosen randomly. The remaining data is called the testing data. This data is used to evaluate your model. You may need another set of data, called the validation set.

    This is used to compare different supervised learning models that were tuned using the test data, depending on what type of model you are creating.

    You will need to separate the non-training data into the validation and testing data sets. It is possible to compare different iterations of the same model with the test data, and the final versions using the validation data.

    Scikit-learn’s train_test_split function is the best way to correctly split data in Python.

    4. Don’t Leak Test Data

    It is important to not feed any test data into your model. This could be as simple as training on the entire data set, or as subtle as performing transformations (such as scaling) before splitting.

    If you normalize your data before splitting, the model will gain information about the test set, since the global minimum and maximum might be in the held-out data.

    5. Use the Right Evaluation Metrics

    Every problem is unique so the evaluation method must be based on that context. Accuracy is the most dangerous and naive classification method. Take the example of cancer detection.

    We should always say “not cancer” if we want to build a reliable model. This will ensure that we are correct 99 percent of the time.

    This isn’t the best model, since we want to detect it. Be careful when deciding which evaluation metric you will use for your regression and classification problems.

    6. Keep it simple

    It is important to select the best solution for your problem and not the most complex. Management, customers, and even you might want to use the “latest-and-greatest.” You need to use the simplest model that meets your needs, a principle called Occam’s Razor.

    This will not only make it easier to see and reduce training time but can also improve performance. You shouldn’t try to kill Godzilla or shoot a fly with your bazooka.

    7. Do not overfit or underfit your model

    Overfitting, also called variance, can lead to poor performance when the model doesn’t see certain data. The model simply remembers the training data.

    Bias, also known as underfitting, is when the model has too few details to be able to accurately represent the problem. These two are often referred to as “bias-variance trading-off”, and each problem requires a different balance.

    Let’s use a simple image classification tool as an example. It is responsible for identifying whether a dog is present in an image.

    This model won’t recognize an image that it is a dog if it hasn’t seen it before. It might not recognize an image of a dog if you overfit it, even though it may have seen it before.

    8. Try Different Model Architectures

    It is often beneficial to look at different models for a particular problem. One model architecture may not work well for another.

    You can mix simple and complex algorithms. If you are creating a classification model, for example, try as simple as random forests and as complex as neural networks.

    Interestingly, extreme gradient boosting is often superior to a neural network classifier. Simple problems are often easier to solve with simple models.

    9. Tune Your Hyperparameters

    These are the values that are used in the model’s calculation. One example of a hyperparameter in a decision tree would be depth.

    This is how many questions the tree will ask before it decides on an answer. The default parameters for a model’s hyperparameters are those that give the highest performance on average.

    It is unlikely that your model will be able to achieve this sweet spot. However, it is possible for your model to perform better if you select different parameters. There are many advanced methods to tune hyperparameters, including grid search, Bayesian-optimized, and randomized search.

    10. Comparing Models Correctly

    Machine learning has the ultimate goal of creating a model that is generalizable. It is important to select the most accurate model by comparing and choosing it correctly.

    You will need a different holdout than the one you used to train your hyperparameters. You will also need to use statistical tests that are appropriate to evaluate the results.

    Thank you!
    Join us on social networks!
    See you!