Student Lounge https://blogs.mathworks.com/student-lounge The student lounge blog focuses on student success stories. Winning student teams share their knowledge and the MathWorks student programs team shares best practices and workflows using MATLAB and Simulink. Mon, 18 Mar 2024 12:04:35 +0000 en-US hourly 1 https://wordpress.org/?v=6.2.2 Building an Intrusion Detection System: A Triumph at the SANReN Cyber Security Challenge https://blogs.mathworks.com/student-lounge/2024/03/18/building-an-intrusion-detection-system-a-triumph-at-the-sanren-cyber-security-challenge/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2024/03/18/building-an-intrusion-detection-system-a-triumph-at-the-sanren-cyber-security-challenge/#respond Mon, 18 Mar 2024 12:04:35 +0000 https://blogs.mathworks.com/student-lounge/?p=10924

Inspiration Meet the champions: Shani Nezar, Uhone Teffo, Carlo Barnardo, and Heinrich E. Guided, this team trained the most accurate machine learning model among the all 10 teams at the SANReN... read more >>

]]>

Inspiration

Meet the champions: Shani Nezar, Uhone Teffo, Carlo Barnardo, and Heinrich E. Guided, this team trained the most accurate machine learning model among the all 10 teams at the SANReN Cyber Security Challenge! The exploited the ease-to-use capabilities of the MathWorks platform and trained machine learning model via MATLAB Classification Learner App for cyber threat detection. Their proficiency was significantly enhanced by complimentary courses like MATLAB Onramp and Machine Learning Onramp, which equipped them with the latest knowledge in AI swiftly and extra points for the competition. Let’s hear their journey:
1705330461559.jpg
Inspiration
In the dynamic realm of cybersecurity, staying one step ahead of potential threats is paramount. The SANReN Cyber Security Challenge provided a platform for teams to showcase their prowess, and our journey through the challenge was marked by a standout achievement in the MATLAB Classification Challenge: a remarkable 98% accuracy score on a machine learning model designed for intrusion detection. The crux of our success lay in the utilization of the open dataset UNSW-NB15, a goldmine of real-time network traffic data with rich features specifically curated for anomaly-based intrusion detection. The data set can be download at the following link.
Breaking Down the Problem
The UNSW-NB15 dataset, with its meticulous labelling of attacks (1) and non-attacks (0), served as the foundation for our challenge. The primary goal was to leverage the features within the dataset to predict whether a given data point belongs to the attack or non-attack category. This, essentially, was the task at hand – developing a robust Intrusion Detection System (IDS) capable of discerning malicious activities from normal network behaviour.
How Did We Implement It?
grapgh.png
Dataset Exploration
Before diving into the development of the machine learning model, we meticulously explored the UNSW-NB15 dataset. Understanding the intricacies of the features, the distribution of data, and the characteristics of attacks proved crucial in designing an effective solution.
pic1.png
Model Selection
Given the nature of the problem, we opted for a machine learning approach. Our model of choice was carefully selected based on its suitability for intrusion detection tasks. After thorough evaluation, we settled on a model that showcased promising results during initial experimentation.
Online Trainings with MATLAB Onramp and Machine Learning Onramp
Our journey to success is significantly enriched by the invaluable skills and insights gained through MATLAB Onramp and Machine Learning Onramp in honing our skills. The online trainings provided by MATLAB Academy equipped our team with essential knowledge, allowing us to navigate the intricacies of data exploration and model development seamlessly. These onramps acted as catalysts in our problem-solving journey, bridging the gap between theoretical understanding and practical application.
Low-code AI with MATLAB
MATLAB’s intuitive environment facilitated a smooth exploration of the dataset. With its user-friendly interface and powerful functionalities, we delved into the data, gaining insights that shaped our approach. MATLAB’s capabilities not only simplified the process but also enhanced our efficiency in handling complex data structures. A noteworthy aspect of our methodology was the utilization of low-code AI with MATLAB. Leveraging an App coupled with a concise 10 lines of code, we navigated what might have seemed like a daunting coding challenge. This approach not only streamlined our implementation but also highlighted the accessibility of AI, even for those not deeply versed in coding intricacies.
Ready-to-Train Models in Classification Learner App
The Classification Learner App merged as a game-changer, providing us with ready-to-train models that significantly expedited our development process. This feature allowed us to focus on the application of AI rather than its intricate development especially with having the correct model hyperparameters. Theavailability of pre-built models within the app played a pivotal role in achieving success without the need for extensive AI expertise.
pic2.png
Results
The culmination of our efforts resulted in an impressive 95.8% accuracy score. Our machine learning model successfully identified and classified attacks with remarkable precision, showcasing the potential of data-driven approaches in cybersecurity. The ability to predict malicious activities with such accuracy reflects not only the efficiency of our chosen model but also the robustness of our methodology.
pic3.png
pic4.png
Key Takeaways
1. Dataset Understanding is Key
Thoroughly understanding the dataset is foundational. MATLAB enables easy features exploration, pattern identification, and a comprehension of the nature of attacks. Such ease to explore data greatly influenced the success of the intrusion detection system.
2. Model Selection Matters
Selecting the optimal machine learning model for intrusion detection is crucial. MATLAB Apps offers a variety of pre-built models, enabling users to concentrate on enhancing the precision required to detect nuanced irregularities in network traffic, which directly influences the system’s efficiency.
3. Real-world Simulation
The inclusion of fresh, unlabelled data for prediction mirrors the challenges faced in real-world cybersecurity. A model’s ability to adapt and identify novel threats is a testament to its practicality.
4. Continuous Improvement
The landscape of cybersecurity is ever-evolving. Regular updates to the model and continuous monitoring ensure that the IDS remains effective in identifying new and emerging threats.
In conclusion, our success at the SANReN Cyber Security Challenge stands as a testament to the power of machine learning in bolstering cybersecurity defences. The journey from dataset exploration to model deployment underscored the importance of meticulous planning, adaptability, and a deep understanding of the intricacies of network traffic. As we celebrate our triumph, we also acknowledge the ongoing commitment required to stay at the forefront of cybersecurity innovation. The path to a secure digital landscape is paved with continuous learning, resilience, and a proactive approach to emerging threats.

]]>
https://blogs.mathworks.com/student-lounge/2024/03/18/building-an-intrusion-detection-system-a-triumph-at-the-sanren-cyber-security-challenge/feed/ 0
Where are they now? – Amine Taoudi, NXP https://blogs.mathworks.com/student-lounge/2024/03/04/where-are-they-now-amine-taoudi-nxp/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2024/03/04/where-are-they-now-amine-taoudi-nxp/#respond Mon, 04 Mar 2024 17:40:46 +0000 https://blogs.mathworks.com/student-lounge/?p=10888

Today, We’re talking to Amine Taoudi: a Vehicle Controls and Networking Solutions Applications Engineer at NXP Semiconductors. As a student at Mississippi Statue University, Amine participated... read more >>

]]>
Today, We’re talking to Amine Taoudi: a Vehicle Controls and Networking Solutions Applications Engineer at NXP Semiconductors. As a student at Mississippi Statue University, Amine participated in a past Advanced Technology Vehicle Competition (AVTC) called the EcoCAR Mobility Challenge as his teams propulsion controls and modeling lead. During this time Amine won the MathWorks Model-Based Design award 2 years in a row!
NXP_ConnectsEvent.jpg

What Did You Learn in the competition?

Why did you choose to get involved the competiton?

I recognized that the AVTC program is a unique experience to gain hands-on experience in designing and developing electrified vehicle architectures. The scope of the EcoCAR mobility challenge allowed me to learn more about vehicle connectivity and automation technology through workshops held by industry experts, and it provided me with a platfrom to research how these technologies can be used to improve the energy efficiency of electrified powertrains. Additionally, the mentorship programs and multiple events sponsored by the competition sponsors such as MathWorks allowed me to build my networks and get exposure to the key automotive players in North America.

What was your role on the team?

I served as my team’s Propulsion Controls and Modeling lead.

How did you use MATLAB/ Simulink in the competition and/or academic work?

The main focus on the Propulsion Modeling and Controls team that I led is todevelop high fidelity modelsof our competition vehicle and to design, test, and validate propulsion controls software. So, MATLAB/Simulink provided my team with the perfect ecosystem to successfully design and integrate a safe and robust hybrid supervisory controller for our electrified vehicle. My team used products such as theRequirements Toolbox and Simulink Test to author and organize our software requirements and track our test coverage and verification and validation. We also relied on optimization toolboxes to conduct design of experiments for both our high fidelity models and to tune our controllers. Ultimately, we relied on Simulink’s rapid prototyping capabilities to successfully to fast track our software development process which led to us winning the MathWorks sponsored award for best model based design approach two years in a row.
52090768795_adac60a075_k.jpg

How Did the Competition Help You Find a Job?


It was thanks to the competition events that I learned about my current job and it was thanks to the experience I gained through working on the EcoCAR project that I gained the skills necessary to be considered for the position.


Do you think the skills you gained using these tools has helped you in your professional career? If so, how?

Absolutely, I am currently working on large multidisciplinary projects with direct impact on automotive customers and I find myself using the exact same tools and techniques I learned during the competition. A prime example is the rapid prototyping and the embedded code generation tools provided by MathWorks that I use daily for Software in the loop and processor in the loop simulations.

Did your hands-on experience of in the competiton help to prepare you for your first role in industry?

Without a doubt. It was thanks to the competition events that I learned about my current job and it was thanks to the experience I gained through working on the EcoCAR project that I gained the skills necessary to be considered for the position.

What Are You Working On Today?

In your current role at NXP Semiconductors do you use MATLAB/Simulink?

As a vehicle controls and networking solutions applications engineer, I work on enabling NXP’s customers with new solutions for our automotive micro-controllers and micro-processors. A significant amount of the solutions and applications that I work on use model based design. As such, MATLAB/Simulink plays an integral part of my day to day activities.

What big project you are working on right now?

Currently I am working on creating a comprehensive solution for Model Predictive Control (MPC) that is optimized for our automotive microcontrollers. This solution will be integrated into our NXP Model Based Design Toolbox and will provide our customers with a path to design their MPC applications using MATLAB/Simulink and seamlessly deploy it, with guaranteed maximized computational efficiency, on our NXP hardware.

Key Takeaways

What advice would you give young engineers seeking employment post-grad?

For STEM students, you stand to gain a lot by getting involved in technical competitions like EcoCAR. They provide the perfect environment to gain technical and soft skills that are relevant to the industry. It is also important to build a good network, seek mentorship programs, and take advantage of any co-op or internship opportunities.

 

51155865158_7c97897cb2_b.jpg

]]>
https://blogs.mathworks.com/student-lounge/2024/03/04/where-are-they-now-amine-taoudi-nxp/feed/ 0
Virginia Tech AutoDrive Simulation Suite for Autonomous Vehicles https://blogs.mathworks.com/student-lounge/2024/02/20/virginia-tech-autodrive-simulation-suite-for-autonomous-vehicles/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2024/02/20/virginia-tech-autodrive-simulation-suite-for-autonomous-vehicles/#respond Tue, 20 Feb 2024 13:37:38 +0000 https://blogs.mathworks.com/student-lounge/?p=10870

Introduction The focus of this blog is to delve into Virginia Tech’s simulation team and show off how they leveraged MathWorks’ Simulink and MATLAB platforms to gain major insights into the... read more >>

]]>
Introduction
The focus of this blog is to delve into Virginia Tech’s simulation team and show off how they leveraged MathWorks’ Simulink and MATLAB platforms to gain major insights into the development process for autonomous vehicle systems. While the team was able to use MathWorks tools in numerous ways, the simulation team leaned particularly heavily into the ability to dynamically manipulate virtual environments to replicate real driving scenarios. Below is a discussion of how the team was able to create, test, validate, and visualize the data from simulations to fuel the development of our software-driven vehicle.
Motivation
When developing software for the control of an autonomous vehicle, our goal is to develop, deploy, test, analyze, then repeat the process to progress our car closer to full autonomy. This is no easy task. Our team has learned that development takes a lot of effort from all types of sub-teams. Historically, our team has developed and tested software directly on the physical vehicle. Once learning that MathWorks was challenging us to lean more into a simulation-focused approach, we jumped in feet first. Given our established background in MATLAB, the team aimed to learn more about Simulink and how it could serve as a new means of testing our software developments. With that in mind, we set out to create a simulation test bench that could allow us to quickly yet safely deploy our code to virtual vehicles and propel our development pace to new heights.
Methodology
Knowing that we had been challenged to create a simulation allowing us to perform regression-type testing, we knew that developing Simulink subsystems was required. For these reasons, the team has developed a simulation environment comprised of a vehicle dynamics module, a path planning module, a CAN communication module, a global commander module, and a vehicle controller module as seen in Figure 1 below. The focus of this article will be on the vehicle controller and path planning modules as these were among the most impactful software developed entirely in the simulation environment.
Figure 1: VT simulation test-bench including path planning, 3d-visualization, a user operated vehicle, and an experimentally validated vehicle dynamics emulator.
The team was tasked to create a simulation environment where we could create and then vary specific “scenarios”. A scenario in this case refers to a situation that our simulated autonomous vehicle must navigate. We decided to create a scenario where our autonomous vehicle was driving a given path but was disturbed by another vehicle driving into its path. This requires the vehicle to handle the situation in a few different ways. In some cases, the autonomous vehicle must stop, while in others the vehicle is able to change lanes to continue toward its original destination. Figure 2 below shows a few images illustrating the scenario setup.
Figure 2: Dynamic actor routing as seen from a chase camera angle and birds-eye views
While we started by creating a scenario where all dynamic actors were controlled by predefined routes, our team eventually chose to develop a user interface which consisted of a game controller used by students to manually vary the scenarios. This served a few purposes. The first major benefit of this style of testing is that we can directly interact with the autonomous vehicle in real-time. The second major advantage is that it enables students to be more involved in the testing and analysis portions of simulation. Some of the user interface design can be seen in Figure 3 below.
Figure 3: The team’s user interface with force feedback capabilities allowing for more realistic feel when driving in the simulator.
We outlined a few requirements for the students. They must drive in as close to a legal manner as possible, they cannot hit the autonomous vehicle directly, and they must do their best to cause the autonomous vehicle to fail. With these simple rules in place, we allowed students to interact with the autonomous vehicle as much as they wanted. We recorded and analyzed data from these interactions and used the findings to fuel our development processes. The results of our testing will be outlined later. Overall, this method of allowing students to drive in the simulation allowed for more life-like scenarios. These manual scenarios were all recorded for the ability to play them back in a completely automated test bench in order for us to dive further into a scenario we found particularly interesting.
OShow.jpg
Figure 4: Students testing out and using the Simulation test bench at Virginia Tech’s O-Show event
Results and Validation
The team was able to gain meaningful information using human interactions in simulation. We captured the following three major types of data: vehicle controller data, imagery data, and regression testing parameter performance data. All three require different methods of visualizing and analyzing data. While many options exist, we settled on a custom real-time control data display, a video stream showing us what the vision systems can see along with any lane line data they produced, and finally a spider plot to compare the different metrics we deemed important for regression testing, respectively. Examples of these can be seen in Figure 5 below.
Figure 5: Three data display options used by the team to view real-time control data (top left), spider plots from regression testing (top right), and the vehicle chase camera (bottom).
The real-time control data display was used to monitor all control signals related to the autonomous vehicle throughout our testing. This display consisted of lateral errors, velocity errors, steering wheel angle inputs, and acceleration and braking inputs to name a few. Not only did this information prove useful in finding flaws in our simulation control, but it directly impacted the software developed for our real vehicle, making this analysis more valuable than any simulation testing ever done before by the VT AutoDrive team.
The ability to see what the vehicle “sees” also serves as a great way to discover shortcomings with our perception algorithms. The team was able to display and save the video feeds from the simulation which allowed us to completely redevelop our lane tracking algorithm to work far more efficiently than before. Simulink makes video processing and display far more user-friendly than any other platform we have used, which enables our ability to quickly iterate on software design and see the results in near real-time. While image processing is still being developed by our team, the tools provided to us by Simulink have propelled us forward at a much faster pace than ever before.
One of the most useful discoveries by the simulation team was spider plots as seen in Figure 6 below. These plots serve as a great method of displaying how well a given test case achieves multiple design criteria. It took the team a while to discover and use these plots, but the impact was felt immediately upon implementation. The ability to run regression testing and find the broad effects of changing one or more design variables proved very useful. We were able to find what variables are more strongly linked together, as well as determining if other variables are adequately independent. While this may not sound groundbreaking, the team was able to determine if some of our assumed control strategies were possible while simultaneously discovering which regions of operation our strategies worked best. Our team has found this data display technique so impactful, that nearly all development is now focusing on these types of data visualizations.
Figure 6: A typical display created by a set of regression tests demonstrating how multiple trials can be quickly compared using high impact parameters.
The results of our regression testing showed that when the autonomous vehicle was maneuvering through lane changes to avoid collisions with the dynamic actors, our algorithms did not control the lateral accelerations adequately. We found that in a few cases, our controller overshot the maximum allowable lateral acceleration limit by up to 8%. While doing a deeper analysis of this problem, we ended up finding that the error was caused by our steering controller. We ended up altering our steering controller to consider our speed, allowing for the accelerations limits to be kept in later testing. While this is currently the only numerically defined result, we also had other findings. We found that our communication structure allowed for read/write errors between different code blocks. Yet another set of results focused on the ability for our lane line detection algorithms to correctly identify and track lane lines local to the front of the vehicle. We found that our original lane tracking software worked nearly 95% of the time with parallel lines in front of the vehicle, but once curves and camera noise were introduced, our original algorithms failed to achieve above a 30% lane tracking ability.
Conclusion
The team was able to draw extremely helpful conclusions from the simulation challenge. The course of learning to create a simulation test bench allowed the team to venture down new paths never considered. While the need to change our controller and our findings regarding the lane line tracking both helped to fix specific problems, we are even happier that we developed a new method for software development. The ability to run our algorithms in Simulink allowed us to do far more than ever before. We were able to learn how to better set up our data communication methods to include things like ping-pong buffers, bitwise checking of runtime conditions, and internal aging counters to ensure data is fresh in the system. We also learned that using toolboxes like the vehicle communications toolbox allows us to focus our efforts more efficiently toward our problems while allowing established solutions to assist us. Finally, we learned that having the ability to flesh out how the many different code modules interact with each other is extremely valuable. We found that using subsystems within Simulink allowed us to have discussions as low- or high-level as we needed. Overall, we found out that simulation is a more powerful tool than any of us ever considered and now our team has completely swapped over to a simulation-based development approach. This approach being one where we can continually develop, deploy, test, and analyze large amounts of data. We have also developed some future goals regarding the simulation test bench. The team is currently working to create an environment where data collected from real world testing will be implemented into the simulation to not only validate, but also fuel our developments toward our fully autonomous vehicle.

]]>
https://blogs.mathworks.com/student-lounge/2024/02/20/virginia-tech-autodrive-simulation-suite-for-autonomous-vehicles/feed/ 0
Climb stairs and shoot the target: A Student Robotics Project! https://blogs.mathworks.com/student-lounge/2024/02/01/climb-stairs-and-shoot-the-target-a-student-robotics-project/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2024/02/01/climb-stairs-and-shoot-the-target-a-student-robotics-project/#respond Thu, 01 Feb 2024 11:00:11 +0000 https://blogs.mathworks.com/student-lounge/?p=10820

For this week’s blog post, we invited an ABU Robocon team, BRACT’s Vishwakarma Institute of Technology, Pune to share their journey to winning 3rd place in the MathWorks Modelling Award at DD... read more >>

]]>
For this week’s blog post, we invited an ABU Robocon team, BRACT’s Vishwakarma Institute of Technology, Pune to share their journey to winning 3rd place in the MathWorks Modelling Award at DD Robocon 2023. For the 2023 season, The theme and problem statement of the contest was “Casting Flowers over Angkor Wat,” which involves the cooperation of a rabbit robot and an elephant robot. The objective of the game is to toss their team’s colored rings into 11 poles located in the Angkor Wat Area. MathWorks is immensely proud of the team’s achievements, and we hope you also find their insights useful!

Introduction

This blog explores the realms of physical modelling and pole identification as required for the challenge. The mechanism verification for the required robots was accomplished using MATLAB and Simulink. Physical modelling allowed us to understand and anticipate the behavior of complex systems. Deployed YOLOv2, an algorithm that provides YOU ONLY LOOK ONCE functionality. On the other hand, computer vision technology was used to gain an acceptable amount of accuracy in object detection.

Methodology

Modelling of the ring

The ring shape is modelled using a revolved solid block. Providing the block with ring dimensions makes it possible to accurately simulate the contact force between the ring shooting mechanism and the ring. The geometry section has two blocks:
  • In the first one, the user is required to input the cross-section of the ring, envisioning it as a square, and providing the coordinates in an anticlockwise manner.
  • In the second block, users are prompted to input the extent of the revolution, setting it to full. In the inertia section, users are required to set the calculated parameters derived from the original ring’s physical model.
Ring

Modelling of the Robots (Elephant Robot and Rabbit Robot)

MATLAB provides a seamless way to import SolidWorks CAD models into Simulink by converting them to XML files and using the ‘smimport‘ command. However, for complex robots, the import may not generate perfectly aligned parts. To address this, we created a simplified CAD model with only essential components for analysis. Additionally, to visualize Weldments accurately, converting them to STEP files or using simple sketches for import into MATLAB proves effective. This integration streamlines the design process and facilitates a more efficient examination of mechanisms.

Modelling of the ring shooting mechanism

In the ring shooting mechanism, a hollow aluminium square section is mounted on a motor shaft at a specific distance from the ring. When we actuate the motor, the link shoots the ring, and it is thrown into the pole.
gif1.gif
Ring Shooting Mechanism

Challenge encountered

One of the main challenges faced was determining the appropriate length and position of the link, as well as the position of the ring on the guideway (A plate, where the ring is placed for shooting). Ensuring consistent placement of the ring actively contributed to the effectiveness of guideway manufacturing. To find the optimal configuration, after conducting multiple simulations in MATLAB, varying the ring positions and link lengths, the best combination was manufactured. The figure illustrates the ideal link length, position, ring placement, and shooting height. This meticulous approach resulted in improved results and overall performance.

Calculate torque for the motor

% Distance along x axis (s_x )= 4 m,
% u=Velocity of ring
% Distance along y axis (s_y )= 1.2 m,
% ω=Angular velocity of link
% I=Moment of Inertia of link=0.0012 kg-m^2,
% t=Time of flight
% Angle of shooting ( θ) = 45°
% s_x = u cos⁡(45°)×t … (1)
% s_y=u sin(45°)×t -1/2 gt^2 …(2)
% Using equation (1) and (2),
% u=7.64 m/s⁡and t=0.74 s
% Using law of conservation of energy,
% (@ Rotational Energy)_((link))=〖Kinetic Energy〗_((ring))
% 1/2×I×ω^2=1/2×M×u^2
% ∴ω=74.13 rad/s and RPM=698
% Torque,τ=I×α α=ω/t … (Assume,t=0.05)
% τ=0.0012×1462.8 α=73.14/0.05 =1462.8
% τ=1.755
% Final Torque =τ×FOS×10 (∴Factor of Safety=1.5)
% Final Torque=27.5 kg-cm

Modelling of Bridge Climbing Mechanism

The main challenge in modelling the slope climbing for the robot was its stability during slope climbing because the centre of mass of the robot was above the ground at a significant height and to address this concern, modeling of slope climbing proved effective to assess whether it climbs the slope or topples along the way.
gif2.gif
Robot climbing the bridge

Modelling of Step Climbing Mechanism

Simulink models of step climbing are used to test the stability of the mechanism and to tackle the challenges encountered.
Robot climbing the step

Pole Detection using Computer Vision and Deep Learning

Computer Vision and Deep Learning techniques were used for the detection of the pole, which helped the Rabbit Robot align properly with the pole and increase the accuracy and efficiency of the operation of the Rabbit Robot. The YOLOv2 algorithm was used for the detection of poles and deployed on a Jetson Nano.

Creating and training dataset

A custom and diverse dataset was made by clicking 800 images containing the actual pole from various angles, scales, backgrounds, and lighting conditions.
gif4.gif

Annotation of Images and Splitting of Dataset

the Image Labeler Application was used for the annotation or labelling of images, ground truth data of the dataset was created by labelling all the images manually. The goal was to detect only one object, the pole. Before labelling, the image size was bigger than the network input size so all the images were resized to [244 244 3], the minimum size to run the YOLOv2 Object Detection Network.

Creating YOLOv2 Object Detection Network

This network consists of a feature extraction network and a detection network. ResNet-50 was used for feature extraction which is a pertained 50 layers convolutional neural network. The object detection network was made by specifying all the parameters, like input size, number of classes and anchor boxes.

Data Augmentation

Randomly transforming the training dataset to create a new diverse variation of an already existing dataset. A few techniques used for data augmentation include brightness and contrast adjustment, and blurring.

Train YOLOv2 Object Detector and Evaluate

Specify all the parameters like batch size, learning rate and epochs to train the object ‘detector’ model. The average precision metric of the Computer Vision Toolbox has been used to evaluate performance. The precision–recall curve was plotted to test how precise our detector model has been trained. The average precision was achieved at 0.91.

Results and Conclusion

Team BRACT’S VIT Pune modelled robots and systems using MATLAB and Simulink. The team’s efforts were primarily driven by a desire to solve practical problems with effective engineering solutions. The lack of a direct ring form in SimScape prompted them to creatively use the extruded solid block to replicate the ring shooting mechanism. Furthermore, it addresses the stability issues that can arise when climbing a slope or meticulously refining the robot’s design and shooting mechanism using thorough simulations. The team used its object detection model on the Nvidia Jetson Nano, further embracing cutting-edge technology like computer vision and deep learning to precisely detect and align with respect to the poles present in Robocon’23 Arena

Future Scope

MATLAB’s numerical computing and simulation features allow users to explore more intricate physical models, improving the performance and stability of their robot. Users can explore new configurations and improve old models using Simulink to construct and evaluate complex mechanisms and can navigate challenging areas and take on duties other than pole detection by incorporating cutting-edge sensors and perception algorithms into the robots. Participating in robotics contests can help team members develop their innovative, collaborative, and leadership skills. All things considered, it provides the ability to advance robotics and significantly advance automation and intelligence systems.

]]>
https://blogs.mathworks.com/student-lounge/2024/02/01/climb-stairs-and-shoot-the-target-a-student-robotics-project/feed/ 0
Navigating the depth: Advocating for better Engineering Education with MATLAB on YouTube https://blogs.mathworks.com/student-lounge/2024/01/15/navigating-the-depth-advocating-for-better-engineering-education-with-matlab-on-youtube/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2024/01/15/navigating-the-depth-advocating-for-better-engineering-education-with-matlab-on-youtube/#comments Mon, 15 Jan 2024 09:00:31 +0000 https://blogs.mathworks.com/student-lounge/?p=10775

Today, we are talking to Phil Parisi, who recently graduated from the University of Rhode Island and now works as a robotics researcher at a coastal research laboratory. He also runs a popular... read more >>

]]>
Today, we are talking to Phil Parisi, who recently graduated from the University of Rhode Island and now works as a robotics researcher at a coastal research laboratory. He also runs a popular YouTube channel where he often discusses MATLAB coding.

When did you first get exposed to MATLAB and Simulink?

I first learned MATLAB during the second semester of my freshman year of college in a general engineering course. This came right after learning 3D CAD modeling and, admittedly, it was a jarring switch. MATLAB was introduced as ‘something we needed to know.’ However, my engineering colleagues and I were focused on design and manufacturing and were confused as to why we needed to start programming. We used the language for the semester and then didn’t touch MATLAB again in our engineering curriculum until our junior year in Numerical Methods.
Picture1.jpg
Hans Scharler and Toshi Takeuchi welcome Phil Parisi for a visit at the MathWorks Campus in Natick, MA.

So, you were not hooked right away?

No. I was not a fan of MATLAB to start. I had no motivation because I didn’t understand why I needed to learn MATLAB in my freshman year.

What changed your mind?

The first time I viewed MATLAB as something that could help me with coursework was in my Linear Algebra class. We were solving matrix functions (e.g. multiplication, determinants, Gaussian elimination) by hand and I began using MATLAB to check my answers. It certainly helped me earn my ‘A’ in that class!
I appreciated the language only after realizing the benefits. In Numerical Methods, we began working with problems that were no longer possible to solve analytically. Seeing how MATLAB could provide approximate, numerical solutions to complex problems was empowering and exciting. It was then that I actually learned MATLAB and actively sought to improve my coding skills.

What else did you use MATLAB for in school?

My MATLAB usage grew exponentially in ocean engineering graduate school. Classes such as Acoustics, Oceanographic Data Analysis, Random Processes, and Probabilistic Robotics all depended on MATLAB to churn through data and generate outputs. Additionally, I used MATLAB for my Masters Thesis research project, seafloor mapping with machine learning, to test out code concepts before writing compilable C++ programs.
MATLAB’s competitive advantage over other languages became clear when my friend and I had to develop a robotic particle filter algorithm to track a free-falling ocean vehicle as it drifted 8,000m to the seafloor. We ran a for loop in MATLAB to continually update the state estimate of the vehicle, and ChatGPT offered ways to speed up the algorithm (parfor loops, reducing data transfer between functions, and pre-allocating space for large matrices). Because of MATLAB’s ease of use, we completed this project in four weeks, wrote a paper, and presented our work at a conference a few months later!
Picture2.png
Algorithmically tracking the Deep Autonomous Profiler as it falls 8,000m to the seafloor. A measurement update (gray hemisphere) culls 100,000 particles’ initial positions (red) to more confident estimates (black) using MATLAB.

You started a YouTube channel. Why did you decide to do so?

I wanted to offer students and young professionals an opportunity to learn the MATLAB language through a better lens than when I learned it. I believe engineering brains are wired differently than computer scientists’ brains, and thus the educational approaches to teach engineers how to code must be adjusted. As engineers, we need practical examples of how MATLAB can be used to fit a curve to data, perform numerical integration, or calculate a gradient, while also understanding foundational concepts like data types and data structures. I continually try to coalesce these concepts together into my YouTube videos and wish to provide a comfortable platform for new learners.

You have strong opinions about how MATLAB is taught in engineering schools.

I think there needs to be structural changes in the way universities embrace programming throughout the curriculum. We need to see consistent usage across courses, during every academic semester. In Material Selection, for example, homework problems should ask for the stress seen for each common material at a variety of thicknesses (rather than stress experienced by a 48” cantilever beam made of steel). We can move from a single computation done by hand (the ladder) to using a few matrix calculations that help us understand material behaviors broadly (the former). If this approach can be taken for each class, then engineers will be equipped with a mindset that inherently incorporates programming, rather than using MATLAB purely as an afterthought.

What do you do for work now? Were MATLAB skills useful to get to the position you now have?

Currently, I work as a robotics researcher at a coastal research laboratory. MATLAB was certainly a requirement, in addition to Python, as they are ‘common languages’ spoken across researchers from different fields. However, more important than knowing the language itself was demonstrating my ability to apply a programming language to solve a complex problem. That’s what it really means to ‘know a language’ and that’s what employers want to see.

Do you use MATLAB at your job? If so, can you share something interesting about your use of MATLAB at your job?

Yes, I use MATLAB amidst a variety of programming languages. Daily, I have to ask myself what tool to use to solve a given task. Some days are spent in Python running scripts on a RaspberryPi, others are spent looking at sensor data .mat files in MATLAB, and others are spent learning new frameworks like Docker to spin up Linux environments so projects can run smoothly. I’ve started seeing the macro-perspective of embracing each language for what it’s best at. And when it comes to getting a concept working, quick n’ dirty, I always turn to MATLAB first!

]]>
https://blogs.mathworks.com/student-lounge/2024/01/15/navigating-the-depth-advocating-for-better-engineering-education-with-matlab-on-youtube/feed/ 3
Predicting Timely Diagnosis of Metastatic Breast Cancer for the WiDS Datathon 2024 https://blogs.mathworks.com/student-lounge/2024/01/01/predicting-timely-diagnosis-of-metastatic-breast-cancer-for-the-wids-datathon-2024/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2024/01/01/predicting-timely-diagnosis-of-metastatic-breast-cancer-for-the-wids-datathon-2024/#respond Mon, 01 Jan 2024 11:00:09 +0000 https://blogs.mathworks.com/student-lounge/?p=10714

In today’s blog, Grace Woolson will show how you can use MATLAB and machine learning to make meaningful deductions from healthcare data for patients who have been diagnosed with metastatic breast... read more >>

]]>
In today’s blog, Grace Woolson will show how you can use MATLAB and machine learning to make meaningful deductions from healthcare data for patients who have been diagnosed with metastatic breast cancer. Over to you Grace!

Introduction

In this blog, I will show how you can use MATLAB for the WiDS Datathon 2024 using the dataset for the WiDS Datathon #1, which runs from January 9th 2024 – March 1st 2024. This challenge tasks participants with creating a model that can predict whether or not a patient with metastatic breast cancer will receive a diagnosis within 90 days based on patient and environmental data. This can help identify relationships between demographics or environmental hazards with the likelihood of getting timely treatment. Please note that this blog is based on a subset of the data and there may be slight differences between this dataset and the one provided by WiDS.
MathWorks is happy to support participants of the Women in Data Science Datathon 2024 by providing complimentary MATLAB licenses, tutorials, workshops, and additional resources. To request complimentary licenses for you and your teammates, go to this MathWorks site, click the “Request Software” button, and fill out the software request form.
This tutorial will walk through the following steps of the model-making process:
  1. Importing a Tabular Dataset
  2. Preprocessing the Data
  3. Exploring and Analyzing Tabular Data
  4. Choosing and Creating Features
  5. Training a Machine Learning Model
  6. Evaluating a Machine Learning Model
  7. Making New Predictions and Exporting Submissions

Import Data

First, make sure the ‘Current Folder’ is the folder where you saved the data. If you have not already done so, you can download the data from Kaggle after you register for the datathon. The data is provided as a .CSV file, so we can use the readtable function to import the whole file as a table.
dataFolder = fullfile(pwd);
trainDataFilename = ‘Training.csv’;
allTrainData = readtable(fullfile(dataFolder, trainDataFilename))
allTrainData = 12906×83 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender bmi breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code metastatic_first_novel_treatment metastatic_first_novel_treatment_type Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married
1 475714 ‘MEDICAID’ ‘CA’ 924 84 ‘F’ NaN ‘C50919’ ‘Malignant neoplasm of unsp site of unspecified female breast’ ‘C7989’ ‘West’ ‘Pacific’ 3.1438e+04 1.1896e+03 30.6429 16.0143 15.5429 17.6143 14.0143 11.6143 11.5571 7.5714 4 2.1000 49.8571 50.1429 36.5714
2 349367 ‘White’ ‘COMMERCIAL’ ‘CA’ 928 62 ‘F’ 28.4900 ‘C50411’ ‘Malig neoplm of upper-outer quadrant of right female breast’ ‘C773’ ‘West’ ‘Pacific’ 3.9122e+04 2.2959e+03 38.2000 11.8788 13.3545 14.2303 13.4182 13.3333 14.0606 10.2485 5.9515 3.5030 49.8939 50.1061 50.2455
3 138632 ‘White’ ‘COMMERCIAL’ ‘TX’ 760 43 ‘F’ 38.09 ‘C50112’ ‘Malignant neoplasm of central portion of left female breast’ ‘C773’ ‘South’ ‘West South Central’ 2.1997e+04 626.2367 37.9067 13.0283 14.4633 12.5317 13.5450 12.8600 12.7700 11.4267 6.5650 2.8117 50.1233 49.8767 55.7533
4 617843 ‘White’ ‘COMMERCIAL’ ‘CA’ 926 45 ‘F’ NaN ‘C50212’ ‘Malig neoplasm of upper-inner quadrant of left female breast’ ‘C773’ ‘West’ ‘Pacific’ 3.2795e+04 1.8962e+03 42.8714 10.0714 12.1357 12.5381 12.4643 12.6500 14.8476 12.2810 8.2167 4.7595 49.0667 50.9333 52.6048
5 817482 ‘COMMERCIAL’ ‘ID’ 836 55 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C773’ ‘West’ ‘Mountain’ 1.0886e+04 116.8860 43.4735 10.8240 13.9760 9.4920 10.3640 12.6000 14.9920 14.8360 9.4620 3.4660 52.3120 47.6880 57.8820
6 111545 ‘White’ ‘MEDICARE ADVANTAGE’ ‘NY’ 141 66 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C7981’ ‘Northeast’ ‘Middle Atlantic’ 5.6438e+03 219.3629 45.1800 8.5114 14.8571 11.0886 9.7543 13.6143 13.3743 15.6857 9.4457 3.6457 50.9114 49.0914 51.3229
7 914071 ‘COMMERCIAL’ ‘CA’ 900 51 ‘F’ 29.0500 ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C779’ ‘West’ ‘Pacific’ 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785
8 479368 ‘White’ ‘COMMERCIAL’ ‘IL’ 619 60 ‘F’ NaN ‘C50512’ ‘Malig neoplasm of lower-outer quadrant of left female breast’ ‘C773’ ‘Midwest’ ‘East North Central’ 3.4041e+03 25.7333 42.7900 11.9833 13.2567 9.5733 12.4000 11.8133 13.5767 14.0433 8.5267 4.8533 49.2833 50.7167 55.8867
9 994014 ‘White’ ‘MEDICARE ADVANTAGE’ 973 82 ‘F’ NaN ‘1744’ ‘Malignant neoplasm of upper-outer quadrant of female breast’ ‘C7800’ 1.0111e+04 240.5785 44.9159 9.0646 12.1000 11.1385 11.5123 10.5354 12.7292 18.5462 10.7431 3.6400 52.5846 47.4154 52.9938
10 155485 ‘COMMERCIAL’ ‘IL’ 617 64 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C773’ ‘Midwest’ ‘East North Central’ 4.4353e+03 68.0019 41.3000 12.8358 13.6811 10.5245 11.9377 11.6585 13.5774 13.7434 7.6868 4.3415 49.3962 50.6038 57.8962
11 875977 ‘MEDICARE ADVANTAGE’ ‘MI’ 488 67 ‘F’ NaN ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C799’ ‘Midwest’ ‘East North Central’ 8101 246.2810 40.2782 11.0456 14.7684 13.3848 11.4671 11.2203 14.8975 12.5899 7.1494 3.4709 51.3228 48.6772 49.0658
12 343914 ‘MEDICARE ADVANTAGE’ ‘CA’ 900 66 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C7800’ ‘West’ ‘Pacific’ 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785
13 266700 ‘White’ ‘COMMERCIAL’ ‘MI’ 480 58 ‘F’ NaN ‘C50812’ ‘Malignant neoplasm of ovrlp sites of left female breast’ ‘C781’ ‘Midwest’ ‘East North Central’ 1.6938e+04 894.1681 42.9348 10.5116 11.8130 11.9217 12.4043 12.4304 15.2710 13.8826 7.8522 3.8971 50.0217 49.9783 50.9565
14 437659 ‘IL’ 606 82 ‘F’ undefined ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C779’ ‘Midwest’ ‘East North Central’ 4.8671e+04 6.4314e+03 35.7554 10.4286 10.6518 18.3107 18.9036 11.9696 11.7268 9.6839 5.4071 2.8911 48.6964 51.3036 35.9304
I want to see some high-level statistics about the data, so I’ll use the summary function to get an idea of what kind of information we have.
summary(allTrainData)

Variables:

patient_id: 12906×1 double

Values:

Min 1.0006e+05
Median 5.4352e+05
Max 9.999e+05

patient_race: 12906×1 cell array of character vectors

payer_type: 12906×1 cell array of character vectors

patient_state: 12906×1 cell array of character vectors

patient_zip3: 12906×1 double

Values:

Min 101
Median 554
Max 999

patient_age: 12906×1 double

Values:

Min 18
Median 59
Max 91

patient_gender: 12906×1 cell array of character vectors

bmi: 12906×1 double

Values:

Min 14
Median 28.19
Max 85
NumMissing 8965

breast_cancer_diagnosis_code: 12906×1 cell array of character vectors

breast_cancer_diagnosis_desc: 12906×1 cell array of character vectors

metastatic_cancer_diagnosis_code: 12906×1 cell array of character vectors

metastatic_first_novel_treatment: 12906×1 cell array of character vectors

metastatic_first_novel_treatment_type: 12906×1 cell array of character vectors

Region: 12906×1 cell array of character vectors

Division: 12906×1 cell array of character vectors

population: 12906×1 double

Values:

Min 635.55
Median 19154
Max 71374
NumMissing 1

density: 12906×1 double

Values:

Min 0.91667
Median 700.34
Max 21172
NumMissing 1

age_median: 12906×1 double

Values:

Min 20.6
Median 40.639
Max 54.57
NumMissing 1

age_under_10: 12906×1 double

Values:

Min 0
Median 11.039
Max 17.675
NumMissing 1

age_10_to_19: 12906×1 double

Values:

Min 6.3143
Median 12.924
Max 35.3
NumMissing 1

age_20s: 12906×1 double

Values:

Min 5.925
Median 12.538
Max 62.1
NumMissing 1

age_30s: 12906×1 double

Values:

Min 1.5
Median 12.443
Max 25.471
NumMissing 1

age_40s: 12906×1 double

Values:

Min 0.8
Median 12.124
Max 17.82
NumMissing 1

age_50s: 12906×1 double

Values:

Min 0
Median 13.568
Max 21.661
NumMissing 1

age_60s: 12906×1 double

Values:

Min 0.2
Median 12.533
Max 29.855
NumMissing 1

age_70s: 12906×1 double

Values:

Min 0
Median 7.3169
Max 19
NumMissing 1

age_over_80: 12906×1 double

Values:

Min 0
Median 3.8
Max 18.825
NumMissing 1

male: 12906×1 double

Values:

Min 39.725
Median 49.976
Max 61.6
NumMissing 1

female: 12906×1 double

Values:

Min 38.4
Median 50.024
Max 60.275
NumMissing 1

married: 12906×1 double

Values:

Min 0.9
Median 49.434
Max 66.903
NumMissing 1

divorced: 12906×1 double

Values:

Min 0.2
Median 12.653
Max 19.831
NumMissing 1

never_married: 12906×1 double

Values:

Min 13.44
Median 32.004
Max 98.9
NumMissing 1

widowed: 12906×1 double

Values:

Min 0
Median 5.5208
Max 23.055
NumMissing 1

family_size: 12906×1 double

Values:

Min 2.5504
Median 3.1665
Max 4.1723
NumMissing 4

family_dual_income: 12906×1 double

Values:

Min 19.312
Median 52.592
Max 70.925
NumMissing 4

income_household_median: 12906×1 double

Values:

Min 29222
Median 69803
Max 1.6412e+05
NumMissing 4

income_household_under_5: 12906×1 double

Values:

Min 0.75
Median 2.8382
Max 19.62
NumMissing 4

income_household_5_to_10: 12906×1 double

Values:

Min 0.36154
Median 2.1604
Max 11.872
NumMissing 4

income_household_10_to_15: 12906×1 double

Values:

Min 1.0154
Median 3.7171
Max 14.278
NumMissing 4

income_household_15_to_20: 12906×1 double

Values:

Min 1.0278
Median 3.7712
Max 12.918
NumMissing 4

income_household_20_to_25: 12906×1 double

Values:

Min 1.1
Median 4.0421
Max 14.35
NumMissing 4

income_household_25_to_35: 12906×1 double

Values:

Min 2.65
Median 8.4353
Max 18.34
NumMissing 4

income_household_35_to_50: 12906×1 double

Values:

Min 1.7
Median 11.793
Max 24.075
NumMissing 4

income_household_50_to_75: 12906×1 double

Values:

Min 4.95
Median 17.076
Max 27.13
NumMissing 4

income_household_75_to_100: 12906×1 double

Values:

Min 4.7333
Median 12.677
Max 24.8
NumMissing 4

income_household_100_to_150: 12906×1 double

Values:

Min 4.2889
Median 16.016
Max 31.325
NumMissing 4

income_household_150_over: 12906×1 double

Values:

Min 0.84
Median 14.703
Max 52.824
NumMissing 4

income_household_six_figure: 12906×1 double

Values:

Min 5.6926
Median 30.575
Max 69.032
NumMissing 4

income_individual_median: 12906×1 double

Values:

Min 4316
Median 35253
Max 88910
NumMissing 1

home_ownership: 12906×1 double

Values:

Min 15.85
Median 69.669
Max 90.367
NumMissing 4

housing_units: 12906×1 double

Values:

Min 0
Median 6994.4
Max 25923
NumMissing 1

home_value: 12906×1 double

Values:

Min 60629
Median 2.4784e+05
Max 1.8531e+06
NumMissing 4

rent_median: 12906×1 double

Values:

Min 448.4
Median 1168
Max 2965.2
NumMissing 4

rent_burden: 12906×1 double

Values:

Min 17.416
Median 30.986
Max 78.94
NumMissing 4

education_less_highschool: 12906×1 double

Values:

Min 0
Median 10.843
Max 34.325
NumMissing 1

education_highschool: 12906×1 double

Values:

Min 0
Median 27.406
Max 53.96
NumMissing 1

education_some_college: 12906×1 double

Values:

Min 7.2
Median 29.286
Max 50.133
NumMissing 1

education_bachelors: 12906×1 double

Values:

Min 2.4657
Median 19.047
Max 41.7
NumMissing 1

education_graduate: 12906×1 double

Values:

Min 2.0941
Median 10.796
Max 51.84
NumMissing 1

education_college_or_above: 12906×1 double

Values:

Min 7.0488
Median 30.141
Max 77.817
NumMissing 1

education_stem_degree: 12906×1 double

Values:

Min 23.915
Median 43.066
Max 73
NumMissing 1

labor_force_participation: 12906×1 double

Values:

Min 30.7
Median 62.778
Max 78.67
NumMissing 1

unemployment_rate: 12906×1 double

Values:

Min 0.82308
Median 5.4741
Max 18.8
NumMissing 1

self_employed: 12906×1 double

Values:

Min 2.263
Median 12.748
Max 25.538
NumMissing 4

farmer: 12906×1 double

Values:

Min 0
Median 0.45493
Max 26.729
NumMissing 4

race_white: 12906×1 double

Values:

Min 14.496
Median 70.878
Max 98.444
NumMissing 1

race_black: 12906×1 double

Values:

Min 0.060976
Median 6.4103
Max 69.66
NumMissing 1

race_asian: 12906×1 double

Values:

Min 0
Median 2.9667
Max 49.85
NumMissing 1

race_native: 12906×1 double

Values:

Min 0
Median 0.43095
Max 76.935
NumMissing 1

race_pacific: 12906×1 double

Values:

Min 0
Median 0.054054
Max 14.758
NumMissing 1

race_other: 12906×1 double

Values:

Min 0.0025641
Median 3.5136
Max 33.189
NumMissing 1

race_multiple: 12906×1 double

Values:

Min 0.43333
Median 5.802
Max 26.43
NumMissing 1

hispanic: 12906×1 double

Values:

Min 0.19444
Median 11.983
Max 91.005
NumMissing 1

disabled: 12906×1 double

Values:

Min 4.6
Median 12.884
Max 35.156
NumMissing 1

poverty: 12906×1 double

Values:

Min 3.4333
Median 12.178
Max 38.348
NumMissing 4

limited_english: 12906×1 double

Values:

Min 0
Median 2.7472
Max 26.755
NumMissing 4

commute_time: 12906×1 double

Values:

Min 12.461
Median 27.788
Max 48.02
NumMissing 1

health_uninsured: 12906×1 double

Values:

Min 2.44
Median 7.4657
Max 27.566
NumMissing 1

veteran: 12906×1 double

Values:

Min 1.2
Median 6.8471
Max 25.2
NumMissing 1

Ozone: 12906×1 double

Values:

Min 30.939
Median 39.108
Max 52.237
NumMissing 29

PM25: 12906×1 double

Values:

Min 2.636
Median 7.6866
Max 11.169
NumMissing 29

N02: 12906×1 double

Values:

Min 2.7604
Median 15.589
Max 31.505
NumMissing 29

DiagPeriodL90D: 12906×1 double

Values:

Min 0
Median 1
Max 1

Take some time to scroll through this summary and see what information or patterns you can learn! Here are some things I notice:
  1. There are a lot of rows or variables that just say “cell array of character vectors”, which doesn’t tell us much about the data.
  2. There are a few variables that have a high ‘NumMissing’ value.
  3. The numeric variables can have dramatically different minimums and maximums.
We can use these observations to make decisions about how we want to explore and preprocess the dataset.

Process and Clean the Data

1. Convert text data to categorical

Text data can be hard for machine learning algorithms to understand, so let’s go through and change each “cell array of character vectors” to a categorical. This will help the algorithm sort the text into different categories instead of understanding it as a series of individual letters.
varTypes = varfun(@class, allTrainData, OutputFormat=“cell”);
catIdx = strcmp(varTypes, “cell”);
varNames = allTrainData.Properties.VariableNames;
catVarNames = varNames(catIdx);
for catNameIdx = 1:length(catVarNames)
allTrainData.(catVarNames{catNameIdx}) = categorical(allTrainData.(catVarNames{catNameIdx}));
end

2. Handle Missing Data

Now I want to handle all that missing data I noticed earlier. I’ll go through each variable and specifically look at variables that are missing data for over half of the rows or observations.
dataSum = summary(allTrainData);
for nameIdx = 1:length(varNames)
varName = varNames{nameIdx};
varNumMissing = dataSum.(varName).NumMissing;
if varNumMissing > (height(allTrainData) / 2)
disp(varName);
disp(varNumMissing);
end
end
bmi
8965
metastatic_first_novel_treatment
12882
metastatic_first_novel_treatment_type
12882
Let’s remove those variables entirely, since they might not be too helpful for our algorithm.
allTrainData = removevars(allTrainData, [“bmi”, “metastatic_first_novel_treatment”, “metastatic_first_novel_treatment_type”])
allTrainData = 12906×80 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married divorced never_married widowed
1 475714 <undefined> MEDICAID CA 924 84 F C50919 Malignant neoplasm of unsp site of unspecified female breast C7989 West Pacific 3.1438e+04 1.1896e+03 30.6429 16.0143 15.5429 17.6143 14.0143 11.6143 11.5571 7.5714 4 2.1000 49.8571 50.1429 36.5714 11.8857 47.1143 4.4429
2 349367 White COMMERCIAL CA 928 62 F C50411 Malig neoplm of upper-outer quadrant of right female breast C773 West Pacific 3.9122e+04 2.2959e+03 38.2000 11.8788 13.3545 14.2303 13.4182 13.3333 14.0606 10.2485 5.9515 3.5030 49.8939 50.1061 50.2455 9.8273 35.2909 4.6515
3 138632 White COMMERCIAL TX 760 43 F C50112 Malignant neoplasm of central portion of left female breast C773 South West South Central 2.1997e+04 626.2367 37.9067 13.0283 14.4633 12.5317 13.5450 12.8600 12.7700 11.4267 6.5650 2.8117 50.1233 49.8767 55.7533 12.3300 27.1950 4.7100
4 617843 White COMMERCIAL CA 926 45 F C50212 Malig neoplasm of upper-inner quadrant of left female breast C773 West Pacific 3.2795e+04 1.8962e+03 42.8714 10.0714 12.1357 12.5381 12.4643 12.6500 14.8476 12.2810 8.2167 4.7595 49.0667 50.9333 52.6048 11.6238 31.1429 4.6238
5 817482 <undefined> COMMERCIAL ID 836 55 F 1749 Malignant neoplasm of breast (female), unspecified C773 West Mountain 1.0886e+04 116.8860 43.4735 10.8240 13.9760 9.4920 10.3640 12.6000 14.9920 14.8360 9.4620 3.4660 52.3120 47.6880 57.8820 14.9640 21.7600 5.4060
6 111545 White MEDICARE ADVANTAGE NY 141 66 F 1749 Malignant neoplasm of breast (female), unspecified C7981 Northeast Middle Atlantic 5.6438e+03 219.3629 45.1800 8.5114 14.8571 11.0886 9.7543 13.6143 13.3743 15.6857 9.4457 3.6457 50.9114 49.0914 51.3229 11.7600 30.8314 6.0914
7 914071 <undefined> COMMERCIAL CA 900 51 F C50912 Malignant neoplasm of unspecified site of left female breast C779 West Pacific 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
8 479368 White COMMERCIAL IL 619 60 F C50512 Malig neoplasm of lower-outer quadrant of left female breast C773 Midwest East North Central 3.4041e+03 25.7333 42.7900 11.9833 13.2567 9.5733 12.4000 11.8133 13.5767 14.0433 8.5267 4.8533 49.2833 50.7167 55.8867 12.6400 24.5267 6.9433
9 994014 White MEDICARE ADVANTAGE <undefined> 973 82 F 1744 Malignant neoplasm of upper-outer quadrant of female breast C7800 <undefined> <undefined> 1.0111e+04 240.5785 44.9159 9.0646 12.1000 11.1385 11.5123 10.5354 12.7292 18.5462 10.7431 3.6400 52.5846 47.4154 52.9938 13.9323 27.5262 5.5615
10 155485 <undefined> COMMERCIAL IL 617 64 F C50912 Malignant neoplasm of unspecified site of left female breast C773 Midwest East North Central 4.4353e+03 68.0019 41.3000 12.8358 13.6811 10.5245 11.9377 11.6585 13.5774 13.7434 7.6868 4.3415 49.3962 50.6038 57.8962 10.8981 24.9547 6.2472
11 875977 <undefined> MEDICARE ADVANTAGE MI 488 67 F C50412 Malig neoplasm of upper-outer quadrant of left female breast C799 Midwest East North Central 8101 246.2810 40.2782 11.0456 14.7684 13.3848 11.4671 11.2203 14.8975 12.5899 7.1494 3.4709 51.3228 48.6772 49.0658 13.6051 31.8848 5.4392
12 343914 <undefined> MEDICARE ADVANTAGE CA 900 66 F 1749 Malignant neoplasm of breast (female), unspecified C7800 West Pacific 3.6054e+04 5.2943e+04 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
13 266700 White COMMERCIAL MI 480 58 F C50812 Malignant neoplasm of ovrlp sites of left female breast C781 Midwest East North Central 1.6938e+04 894.1681 42.9348 10.5116 11.8130 11.9217 12.4043 12.4304 15.2710 13.8826 7.8522 3.8971 50.0217 49.9783 50.9565 12.3145 30.8333 5.9014
14 437659 <undefined> <undefined> IL 606 82 F 1749 Malignant neoplasm of breast (female), unspecified C779 Midwest East North Central 4.8671e+04 6.4314e+03 35.7554 10.4286 10.6518 18.3107 18.9036 11.9696 11.7268 9.6839 5.4071 2.8911 48.6964 51.3036 35.9304 10.2982 49.0054 4.7643
Now I want to look at each row and remove any that are missing too many values. It’s okay to have a couple of missing data points in your dataset, but if you have too many it could cause your machine learning algorithm to be less accurate. I’ll use the Clean Missing Data live task to remove any rows that are missing 2 or more data points.
% Remove missing data
[fullData,missingIndices] = rmmissing(allTrainData,“MinNumMissing”,2);
% Display results
figure
% Get locations of missing data
indicesForPlot = ismissing(allTrainData.patient_age);
mask = missingIndices & ~indicesForPlot;
% Plot cleaned data
plot(find(~missingIndices),fullData.patient_age,“SeriesIndex”,1,“LineWidth”,1.5,
“DisplayName”,“Cleaned data”)
hold on
% Plot data in rows where other variables contain missing entries
plot(find(mask),allTrainData.patient_age(mask),“x”,“SeriesIndex”,“none”,
“DisplayName”,“Removed by other variables”)
% Plot removed missing entries
x = repelem(find(indicesForPlot),3);
y = repmat([ylim(gca) missing]’,nnz(indicesForPlot),1);
plot(x,y,“Color”,[145 145 145]/255,“DisplayName”,“Removed missing entries”)
title(“Number of removed missing entries: ” + nnz(indicesForPlot))
hold off
legend
ylabel(“patient_age”,“Interpreter”,“none”)
clear indicesForPlot mask x y

Explore the Data

Now that the data is cleaned up, you should spend some time exploring your data to understand how different variables may interact with each other or see if you can draw any meaningful conclusions from the data or figure out which variables may be more or less important when it comes to predicting time to diagnosis.

Univariate Analysis

First, I want to separate the data into two datasets: one full of patients who were diagnosed in 90 days or less (the 1 or “True” values), and one full of patients who were not (the 0 or “False” values). This will allow me to explore the data patterns in each of these datasets and look for any meaningful differences.
allTrueIdx = fullData.DiagPeriodL90D == 1;
allTrueData = fullData(allTrueIdx, :);
allTrueData = 7559×80 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married divorced never_married widowed
1 475714 <undefined> MEDICAID CA 924 84 F C50919 Malignant neoplasm of unsp site of unspecified female breast C7989 West Pacific 3.1438e+04 1.1896e+03 30.6429 16.0143 15.5429 17.6143 14.0143 11.6143 11.5571 7.5714 4 2.1000 49.8571 50.1429 36.5714 11.8857 47.1143 4.4429
2 349367 White COMMERCIAL CA 928 62 F C50411 Malig neoplm of upper-outer quadrant of right female breast C773 West Pacific 3.9122e+04 2.2959e+03 38.2000 11.8788 13.3545 14.2303 13.4182 13.3333 14.0606 10.2485 5.9515 3.5030 49.8939 50.1061 50.2455 9.8273 35.2909 4.6515
3 138632 White COMMERCIAL TX 760 43 F C50112 Malignant neoplasm of central portion of left female breast C773 South West South Central 2.1997e+04 626.2367 37.9067 13.0283 14.4633 12.5317 13.5450 12.8600 12.7700 11.4267 6.5650 2.8117 50.1233 49.8767 55.7533 12.3300 27.1950 4.7100
4 914071 <undefined> COMMERCIAL CA 900 51 F C50912 Malignant neoplasm of unspecified site of left female breast C779 West Pacific 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
5 479368 White COMMERCIAL IL 619 60 F C50512 Malig neoplasm of lower-outer quadrant of left female breast C773 Midwest East North Central 3.4041e+03 25.7333 42.7900 11.9833 13.2567 9.5733 12.4000 11.8133 13.5767 14.0433 8.5267 4.8533 49.2833 50.7167 55.8867 12.6400 24.5267 6.9433
6 155485 <undefined> COMMERCIAL IL 617 64 F C50912 Malignant neoplasm of unspecified site of left female breast C773 Midwest East North Central 4.4353e+03 68.0019 41.3000 12.8358 13.6811 10.5245 11.9377 11.6585 13.5774 13.7434 7.6868 4.3415 49.3962 50.6038 57.8962 10.8981 24.9547 6.2472
7 266700 White COMMERCIAL MI 480 58 F C50812 Malignant neoplasm of ovrlp sites of left female breast C781 Midwest East North Central 1.6938e+04 894.1681 42.9348 10.5116 11.8130 11.9217 12.4043 12.4304 15.2710 13.8826 7.8522 3.8971 50.0217 49.9783 50.9565 12.3145 30.8333 5.9014
8 880521 Other COMMERCIAL CA 945 58 F C50911 Malignant neoplasm of unsp site of right female breast C773 West Pacific 3.0154e+04 976.2892 42.1358 10.7531 12.7148 11.7259 13.1012 12.8173 13.3012 12.7716 8.4136 4.4086 49.7272 50.2728 53.0765 10.9123 30.5346 5.4667
9 971531 Hispanic MEDICARE ADVANTAGE IL 606 83 F C50911 Malignant neoplasm of unsp site of right female breast C773 Midwest East North Central 4.8671e+04 6.4314e+03 35.7554 10.4286 10.6518 18.3107 18.9036 11.9696 11.7268 9.6839 5.4071 2.8911 48.6964 51.3036 35.9304 10.2982 49.0054 4.7643
10 529840 White COMMERCIAL MT 590 60 F C50411 Malig neoplm of upper-outer quadrant of right female breast C773 West Mountain 1.2208e+03 2.1597 46.7408 11.1521 11.1000 7.9183 10.3338 10.7577 15.4211 18.9042 9.4479 5
11 198037 White MEDICAID KY 402 45 F C50312 Malig neoplasm of lower-inner quadrant of left female breast C773 South East South Central 2.2669e+04 1.1427e+03 37.4937 10.9688 13.6031 15.2281 14.9219 11.7219 12.1375 11.5188 6.3156 3.5625 48.8344 51.1656 39.7906 15.0312 39.2875 5.8906
12 791301 <undefined> MEDICARE ADVANTAGE CA 958 58 F C50112 Malignant neoplasm of central portion of left female breast C773 West Pacific 3.0687e+04 1.9179e+03 36.5517 11.6207 11.4655 16.1345 15.9655 12.5276 12.4793 11.0655 5.6034 3.1586 49.5138 50.4862 41.5345 13.7034 40.1793 4.5793
13 618259 White COMMERCIAL OH 430 55 F C50311 Malig neoplm of lower-inner quadrant of right female breast C773 Midwest East North Central 1.4386e+04 263.5774 40.6393 11.8852 14.2492 10.8426 11.5590 12.6984 13.8869 12.8131 8.2557 3.8148 49.7082 50.2918 53.7148 13.5279 25.8443 6.9180
14 393934 White <undefined> CO 801 70 F C50911 Malignant neoplasm of unsp site of right female breast C7951 West Mountain 2.1243e+04 564.7743 42.4114 10.4086 14.4486 10.6314 11.8086 13.7086 15.9543 13.2914 6.9514 2.8057 50.7971 49.2029 60.1086 10.8800 25.5914 3.4286
allFalseIdx = fullData.DiagPeriodL90D == 0;
allFalseData = fullData(allFalseIdx, :);
allFalseData = 4598×80 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married divorced never_married widowed
1 617843 White COMMERCIAL CA 926 45 F C50212 Malig neoplasm of upper-inner quadrant of left female breast C773 West Pacific 3.2795e+04 1.8962e+03 42.8714 10.0714 12.1357 12.5381 12.4643 12.6500 14.8476 12.2810 8.2167 4.7595 49.0667 50.9333 52.6048 11.6238 31.1429 4.6238
2 817482 <undefined> COMMERCIAL ID 836 55 F 1749 Malignant neoplasm of breast (female), unspecified C773 West Mountain 1.0886e+04 116.8860 43.4735 10.8240 13.9760 9.4920 10.3640 12.6000 14.9920 14.8360 9.4620 3.4660 52.3120 47.6880 57.8820 14.9640 21.7600 5.4060
3 111545 White MEDICARE ADVANTAGE NY 141 66 F 1749 Malignant neoplasm of breast (female), unspecified C7981 Northeast Middle Atlantic 5.6438e+03 219.3629 45.1800 8.5114 14.8571 11.0886 9.7543 13.6143 13.3743 15.6857 9.4457 3.6457 50.9114 49.0914 51.3229 11.7600 30.8314 6.0914
4 875977 <undefined> MEDICARE ADVANTAGE MI 488 67 F C50412 Malig neoplasm of upper-outer quadrant of left female breast C799 Midwest East North Central 8101 246.2810 40.2782 11.0456 14.7684 13.3848 11.4671 11.2203 14.8975 12.5899 7.1494 3.4709 51.3228 48.6772 49.0658 13.6051 31.8848 5.4392
5 343914 <undefined> MEDICARE ADVANTAGE CA 900 66 F 1749 Malignant neoplasm of breast (female), unspecified C7800 West Pacific 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785 11.3015 50.4569 4.7662
6 615208 Other COMMERCIAL OR 975 62 F C50411 Malig neoplm of upper-outer quadrant of right female breast C786 West Pacific 1.2836e+04 87.3667 48.9208 9.3458 9.4500 8.7833 11.9542 10.3458 12.6000 17.8833 13.8708 5.7542 50.4708 49.5292 53.7167 15.8583 23.1333 7.2708
7 279917 White MEDICARE ADVANTAGE NY 142 75 F C50912 Malignant neoplasm of unspecified site of left female breast C7801 Northeast Middle Atlantic 2.0195e+04 2.1920e+03 36.4690 10.2207 15.4345 17.8241 13.2483 10.2897 11.7345 11.5276 5.8966 3.8345 48.4414 51.5586 31.7517 12.9966 49.4724 5.7759
8 366792 Asian COMMERCIAL MI 482 46 F C50412 Malig neoplasm of upper-outer quadrant of left female breast C773 Midwest East North Central 2.2081e+04 1.6665e+03 36.5861 12.7778 12.8556 15.8083 13.2028 11.8889 12.4556 11.6000 5.9194 3.4833 48.5472 51.4528 26.2417 14.7028 52.5722 6.4611
9 643360 <undefined> COMMERCIAL NY 120 52 F 1744 Malignant neoplasm of upper-outer quadrant of female breast C7800 Northeast Middle Atlantic 5.1122e+03 103.9061 46.2954 9.0636 10.6182 11.3000 10.9576 11.0045 16.7348 15.3530 10.3818 4.5909 52.1348 47.8652 49.7773 13.7470 29.3818 7.0924
10 487817 <undefined> COMMERCIAL TX 773 57 F 1749 Malignant neoplasm of breast (female), unspecified C773 South West South Central 2.4751e+04 352.2268 41.3712 11.9302 12.9868 10.9962 11.1623 13.1075 13.0226 13.0660 9.5774 4.1585 49.3547 50.6453 52.9943 13.3415 25.0943 8.5792
11 345047 <undefined> COMMERCIAL TX 751 78 F C50912 Malignant neoplasm of unspecified site of left female breast C773 South West South Central 1.6981e+04 271.9135 38.5392 13.2529 15.1843 11.8118 11.7980 12.7176 14.0510 11.8373 6.3667 2.9627 49.8667 50.1333 52.8333 13.5725 28.0196 5.5725
12 907418 White MEDICARE ADVANTAGE IN 460 50 F 1749 Malignant neoplasm of breast (female), unspecified C7951 Midwest East North Central 1.3549e+04 256.8795 40.2864 13.3023 13.2045 10.9000 14.2909 12.6364 13.2909 11.8773 6.3886 4.0886 50.8545 49.1455 54.1114 12.4795 27.4068 5.9955
13 908851 White MEDICARE ADVANTAGE FL 339 82 F 1749 Malignant neoplasm of breast (female), unspecified C7800 South South Atlantic 1.8007e+04 479.7347 50.9592 7.9449 9.9816 10.4449 9.7082 9.3714 13.1020 16.6653 16.0143 6.7714 49.0306 50.9694 52.1673 13.8000 25.2510 8.7939
14 785337 Black COMMERCIAL VA 234 54 F 1741 Malignant neoplasm of central portion of female breast C7951 South South Atlantic 1.3242e+04 299.2533 44.9310 9.1227 10.6818 13.8364 11.3295 9.5705 14.7477 14.3523 12.4568 3.8909 50.5205 49.4795 50.1318 13.8023 29.3364 6.7386
Now we can use the Create Plot live task to plot histograms of the different variables in each dataset. In the plot below, blue bars represent data from the folks who were diagnosed in a timely manner, and the red bars represent data from the folks who were not.
figure
% Create histogram of selected data
histogram(allTrueData.health_uninsured,“NumBins”,40,“DisplayName”,“health_uninsured”);
hold on
% Create histogram of selected data
histogram(allFalseData.health_uninsured,“NumBins”,40,“DisplayName”,“health_uninsured”);
hold off
legend
Take some time to explore these visualizations on your own, as I can only show one at a time in this blog. It is worth noting that we have less False data than True data, so the red bars will almost always be lower than the blue bars. If there are red bars that are higher or if the shapes are different, that may indicate a relationship between a variable and time to diagnosis.
I didn’t see many significant differences in shape, though I did notice that for the ‘health_uninsured’ histograms the red vars are fairly high in the higher numbers, indicating that there may be a correlation between populations with high rates of being unisured and time to diagnosis.

Bivariate and Multivariate Analysis

You can break the data down further and plot two (or more!) variables against each other to see if you can find any patterns. In the plot below, for example, we can see the percentage of the population that is unisured and the state the patient is in, broken down by whether or not the patient was diagnosed within 90 days. Again, blue values indicate that the patient was, and red values indicate that the patient was not.
figure
% Create scatter of selected data
scatter(allTrueData,“patient_state”,“health_uninsured”,“DisplayName”,“health_uninsured”);
hold on
% Create scatter of selected data
scatter(allFalseData,“patient_state”,“health_uninsured”,“DisplayName”,“health_uninsured”);
hold off
legend
We can see that in some states, such as GA, OK, or TX, the the red values come from populations that are typically higher in terms of being uninsured. This could indcate that in some states, coming from a zip code with a high population of uninsured folks (or being uninsured yourself) means you are more likely to receive delays in your diagnosis.

Statistical Analysis

You can also create meaningful deductions by calculating various statistics from your data. For example, I want to calculate the skewness, or level of asymmetry, of each of my variables. A negative value indicates the data is left skewed when plotted, and a positive value indicates the data is right skewed when plotted, with a 0 meaning the data is evenly distributed.
statsTrue = varfun(@skewness, allTrueData, “InputVariables”, @isnumeric);
statsFalse = varfun(@skewness, allFalseData, “InputVariables”, @isnumeric);
Now I want to see if any of the variables have a significant difference in their skewness, as differences in the data distributions between patients who were diagnosed in a timely manner vs patients who were not could indicate an underlying relationship between those variables and time to diagnosis.
statsDiffs = abs(statsTrue{:, :} – statsFalse{:, :});
statsTrue.Properties.VariableNames(statsDiffs > 0.2)
ans = 1×4 cell
‘skewness_density”skewness_age_over_80”skewness_rent_burden”skewness_race_native’
If we investigate the four variables that are returned, we can see that population density, the percentage of folks above 80 in your zip code, the median rent burden of your zip code, and the percentage of residents who reported their race as American Indian or Alaska Native in your zip code may have a relationship with time to diagnosis.

Feature Engineering

When it comes to machine learning, you don’t have to use all of the data as it is presented to you. Feature Engineering is the process of deciding what data you want to use, creating new data based on the provided data, and transforming the data to be in whatever format or range is suitable for your workflow. You can do this manually, and some of the exploration we just did should influence decisions you make if you want to play around with including or excluding different variables.
For this blog, I’ll use the gencfeatures function to automate this process. I want to use 90 features, which is 10 more than we currently have in our dataset, and it will go through and create a set of 90 meaningful features based on our processed dataset. It may keep some data as-is, but will often standardize numeric variables and create new variables by manipulating the provided data.
[T, augTrainData] = gencfeatures(fullData, “DiagPeriodL90D”, 90)
Warning: Table variable names were truncated to the length namelengthmax.
T =

FeatureTransformer with properties:

Type: ‘classification’
TargetLearner: ‘linear’
NumEngineeredFeatures: 89
NumOriginalFeatures: 1
TotalNumFeatures: 90

augTrainData = 12157×91 table
metastatic_cancer_diagnosis_code zsc(woe2(breast_cancer_diagnosis_code)) zsc(woe2(breast_cancer_diagnosis_desc)) zsc(woe2(metastatic_cancer_diagnosis_code)) zsc(woe2(patient_state)) zsc(patient_age./Ozone) zsc(patient_age./commute_time) zsc(kmc51) eb28(education_less_highschool) zsc(income_household_35_to_50./income_household_75_to_100) zsc(kmc12) eb11(patient_age) q28(income_household_under_5) zsc(rent_burden-education_less_highschool) q11(patient_age) zsc(sig(family_dual_income)) zsc(sig(patient_age)) zsc(sin(PM25)) zsc(cos(rent_median)) zsc(sin(patient_zip3)) zsc(health_uninsured./PM25) zsc(cos(population)) zsc(cos(education_bachelors)) zsc(sin(hispanic)) q28(density) eb28(education_highschool) zsc(income_household_75_to_100.*rent_burden) q28(unemployment_rate) q28(patient_zip3) zsc(patient_id.*hispanic)
1 C7989 0.5390 0.5390 -0.4180 0.3566 0.2943 1.2850 -0.9221 28 0.4148 0.8408 11 17 -2.6485 11 0.0164 0.0329 0.4182 -1.3204 0.4400 0.1652 -1.3368 -0.8396 -0.9848 19 16 1.3688 25 25 1.8601
2 C773 0.5403 0.5403 0.6847 0.3566 -0.0953 -0.2533 1.1450 14 -1.0911 -0.6892 7 9 0.2093 7 0.0164 0.0329 0.5817 0.4439 -1.4209 -0.5852 -1.2593 0.1155 0.3359 24 8 0.6588 11 26 0.2666
3 C773 0.4472 0.4472 0.6847 -0.4817 -1.1723 -1.2683 -0.5463 9 -0.1595 0.1389 3 10 -0.1364 2 0.0164 0.0329 0.8441 -0.5097 -0.4515 1.2506 1.0719 0.8394 0.7034 14 14 0.0868 7 19 -0.6458
4 C773 0.4804 0.4804 0.6847 0.3566 -1.1790 -0.8614 2.4460 2 -1.4278 -2.6472 3 19 1.2015 2 0.0164 0.0329 0.5895 0.0984 0.9142 -0.9343 -1.3164 -0.5676 -1.2673 23 1 -0.8003 13 26 0.0137
5 C773 -1.8389 -1.8389 0.6847 -0.1188 -0.4668 -0.1359 -0.1677 10 1.1998 0.5623 5 11 -0.6774 5 0.0164 0.0329 -1.9107 -1.1374 0.3927 2.8054 -1.0779 0.0894 0.9926 5 16 -0.8990 5 21 0.0631
6 C7981 -1.8389 -1.8389 -3.0232 -0.7796 0.3970 0.6970 -0.5463 5 -0.4166 0.1389 8 1 -0.1713 9 0.0164 0.0329 -0.8052 -1.4920 0.4399 -0.6668 0.1846 -1.5704 0.8916 8 23 -0.1033 3 3 -0.8538
7 C779 0.5150 0.5150 -1.0339 0.3566 -0.7176 -0.8180 1.2575 26 0.1398 -0.9325 4 28 -0.8526 4 0.0164 0.0329 -2.1568 0.2754 1.3443 -0.4261 0.5523 0.3214 1.4403 27 6 -0.0186 26 23 2.7136
8 C773 0.5605 05605 0.6847 -0.8787 0.2552 0.4365 -0.9221 8 1.2589 0.8408 6 12 -1.0383 7 0.0164 0.0329 0.6462 0.3778 -0.2288 0.0129 0.3019 1.0857 0.5697 1 26 -1.3912 17 17 -0.7649
9 C773 0.5150 0.5150 0.6847 -0.8787 0.5303 0.7654 -0.5463 1 -0.3771 0.1389 7 15 0.0934 8 0.0164 0.0329 0.6494 1.3120 1.2738 -0.8822 1.1952 1.2474 0.1205 3 23 -0.1783 2 17 -0.8367
10 C799 0.5892 0.5892 -0.7083 0.4285 0.6062 0.4731 -0.1677 5 0.3661 0.5623 8 4 -0.1511 9 0.0164 0.0329 0.8468 0.7382 -1.3160 -0.3069 -0.5051 -1.0839 -1.0734 9 22 -0.5364 11 13 -0.5800
11 C7800 -1.8389 -1.8389 -0.4676 0.3566 0.2790 -0.0623 2.5856 26 0.1398 -3.1658 8 28 -0.8526 9 0.0164 0.0329 -2.1568 0.2754 1.3443 -0.4261 0.5523 0.3214 1.4403 27 6 -0.0186 26 23 0.4736
12 C781 0.5758 0.5758 -1.7502 0.4285 -0.0598 -0.2406 -0.5463 4 -0.4701 0.1389 6 8 0.1148 6 0.0164 0.0329 0.8290 -1.2051 0.8002 -0.9475 0.0945 1.1447 0.5680 16 14 -0.0036 16 12 -0.8148
13 C773 0.5668 0.5668 0.6847 0.3566 0.2459 -0.8014 1.2575 6 -1.3361 -0.9325 6 5 0.6748 6 0.0164 0.0329 -0.6779 0.9431 0.7497 -0.9270 0.9452 -0.4288 0.1632 17 4 -0.4941 13 27 0.7836
14 C786 0.5403 0.5403 -0.0208 0.4096 0.4423 0.4481 -1.0650 7 1.5972 0.0484 7 22 1.3062 7 0.0164 0.0329 -1.4776 0.6570 1.1964 0.8045 1.4144 0.4459 1.3659 4 15 0.5212 22 28 -0.4435
To better understand the generated features, you can use the describe function of the returned FeatureTransformer object, ‘T’.
describe(T)
Type IsOriginal InputVariables Transformations
___________ __________ _____________________________________________________ ________________________________________________________________________metastatic_cancer_diagnosis_code Categorical true metastatic_cancer_diagnosis_code
zsc(woe2(breast_cancer_diagnosis_code)) Numeric false breast_cancer_diagnosis_code Weight of Evidence (positive class = 1)
Standardization with z-score (mean = -0.046637, std = 1.5098)
zsc(woe2(breast_cancer_diagnosis_desc)) Numeric false breast_cancer_diagnosis_desc Weight of Evidence (positive class = 1)
Standardization with z-score (mean = -0.046637, std = 1.5098)
zsc(woe2(metastatic_cancer_diagnosis_code)) Numeric false metastatic_cancer_diagnosis_code Weight of Evidence (positive class = 1)
Standardization with z-score (mean = 0.0067098, std = 0.28786)
zsc(woe2(patient_state)) Numeric false patient_state Weight of Evidence (positive class = 1)
Standardization with z-score (mean = 0.0060064, std = 0.23323)
zsc(patient_age./Ozone) Numeric false patient_age, Ozone patient_age ./ Ozone
Standardization with z-score (mean = 1.5005, std = 0.36544)
zsc(patient_age./commute_time) Numeric false patient_age, commute_time patient_age ./ commute_time
Standardization with z-score (mean = 2.1895, std = 0.64638)
zsc(kmc51) Numeric false all valid numeric variables Centroid encoding (component #51) (kmeans clustering with k = 10)
Standardization with z-score (mean = 5.9447, std = 0.1673)
eb28(education_less_highschool) Categorical false education_less_highschool Equal-width binning (number of bins = 28)
zsc(income_household_35_to_50./income_household_75_to_100) Numeric false income_household_35_to_50, income_household_75_to_100 income_household_35_to_50 ./ income_household_75_to_100
Standardization with z-score (mean = 0.93234, std = 0.2685)
zsc(kmc12) Numeric false all valid numeric variables Centroid encoding (component #12) (kmeans clustering with k = 10)
Standardization with z-score (mean = 13.4409, std = 0.15797)
eb11(patient_age) Categorical false patient_age Equal-width binning (number of bins = 11)
q28(income_household_under_5) Categorical false income_household_under_5 Equiprobable binning (number of bins = 28)
zsc(rent_burden-education_less_highschool) Numeric false rent_burden, education_less_highschool rent_burden – education_less_highschool
Standardization with z-score (mean = 19.3265, std = 5.7168)
q11(patient_age) Categorical false patient_age Equiprobable binning (number of bins = 11)
zsc(sig(family_dual_income)) Numeric false family_dual_income sigmoid( )
Standardization with z-score (mean = 1, std = 4.2283e-11)
zsc(sig(patient_age)) Numeric false patient_age sigmoid( )
Standardization with z-score (mean = 1, std = 4.0863e-10)
zsc(sin(PM25)) Numeric false PM25 sin( )
Standardization with z-score (mean = 0.42558, std = 0.65419)
zsc(cos(rent_median)) Numeric false rent_median cos( )
Standardization with z-score (mean = 0.046444, std = 0.68827)
zsc(sin(patient_zip3)) Numeric false patient_zip3 sin( )
Standardization with z-score (mean = 0.054487, std = 0.70171)
zsc(health_uninsured./PM25) Numeric false health_uninsured, PM25 health_uninsured ./ PM25
Standardization with z-score (mean = 1.1917, std = 0.6234)
zsc(cos(population)) Numeric false population cos( )
Standardization with z-score (mean = -0.03209, std = 0.71354)
zsc(cos(education_bachelors)) Numeric false education_bachelors cos( )
Standardization with z-score (mean = 0.096871, std = 0.68966)
zsc(sin(hispanic)) Numeric false hispanic sin( )
Standardization with z-score (mean = 0.017785, std = 0.6817)
q28(density) Categorical false density Equiprobable binning (number of bins = 28)
eb28(education_highschool) Categorical false education_highschool Equal-width binning (number of bins = 28)
zsc(income_household_75_to_100.*rent_burden) Numeric false income_household_75_to_100, rent_burden income_household_75_to_100 .* rent_burden
Standardization with z-score (mean = 392.7502, std = 61.6458)
q28(unemployment_rate) Categorical false unemployment_rate Equiprobable binning (number of bins = 28)
q28(patient_zip3) Categorical false patient_zip3 Equiprobable binning (number of bins = 28)
zsc(patient_id.*hispanic) Numeric false patient_id, hispanic patient_id .* hispanic
Standardization with z-score (mean = 10169065.2502, std = 11587944.1233)
zsc(home_value.*race_other) Numeric false home_value, race_other home_value .* race_other
Standardization with z-score (mean = 2725364.3718, std = 4298818.8992)
zsc(patient_age.*income_household_20_to_25) Numeric false patient_age, income_household_20_to_25 patient_age .* income_household_20_to_25
Standardization with z-score (mean = 241.7171, std = 97.8001)
q25(farmer) Categorical false farmer Equiprobable binning (number of bins = 25)
q27(race_native) Categorical false race_native Equiprobable binning (number of bins = 27)
eb28(age_median) Categorical false age_median Equal-width binning (number of bins = 28)
q28(never_married) Categorical false never_married Equiprobable binning (number of bins = 28)
zsc(cos(patient_age)) Numeric false patient_age cos( )
Standardization with z-score (mean = 0.021113, std = 0.71469)
zsc(sin(race_black)) Numeric false race_black sin( )
Standardization with z-score (mean = 0.16517, std = 0.70668)
zsc(tanh(age_50s)) Numeric false age_50s tanh( )
Standardization with z-score (mean = 1, std = 8.9224e-09)
zsc(male+female) Numeric false male, female male + female
Standardization with z-score (mean = 100.0001, std = 0.000436)
q28(female) Categorical false female Equiprobable binning (number of bins = 28)
eb28(male) Categorical false male Equal-width binning (number of bins = 28)
zsc(sin(age_median)) Numeric false age_median sin( )
Standardization with z-score (mean = -0.1365, std = 0.71613)
q28(home_ownership) Categorical false home_ownership Equiprobable binning (number of bins = 28)
zsc(age_over_80./income_household_20_to_25) Numeric false age_over_80, income_household_20_to_25 age_over_80 ./ income_household_20_to_25
Standardization with z-score (mean = 1.0866, std = 0.51568)
zsc(cos(education_highschool)) Numeric false education_highschool cos( )
Standardization with z-score (mean = -0.019221, std = 0.71994)
zsc(cos(race_black)) Numeric false race_black cos( )
Standardization with z-score (mean = -0.020693, std = 0.68773)
q28(self_employed) Categorical false self_employed Equiprobable binning (number of bins = 28)
zsc(cos(age_median)) Numeric false age_median cos( )
Standardization with z-score (mean = -0.029038, std = 0.68394)
q50(patient_id) Categorical false patient_id Equiprobable binning (number of bins = 50)
zsc(sin(race_asian)) Numeric false race_asian sin( )
Standardization with z-score (mean = 0.28421, std = 0.64235)
q28(education_stem_degree) Categorical false education_stem_degree Equiprobable binning (number of bins = 28)
zsc(cos(age_20s)) Numeric false age_20s cos( )
Standardization with z-score (mean = 0.10518, std = 0.69162)
eb23(N02) Categorical false N02 Equal-width binning (number of bins = 23)
q28(rent_burden) Categorical false rent_burden Equiprobable binning (number of bins = 28)
zsc(race_asian.*veteran) Numeric false race_asian, veteran race_asian .* veteran
Standardization with z-score (mean = 28.4889, std = 30.7)
zsc(sin(income_household_35_to_50)) Numeric false income_household_35_to_50 sin( )
Standardization with z-score (mean = 0.03083, std = 0.68752)
zsc(cos(patient_zip3)) Numeric false patient_zip3 cos( )
Standardization with z-score (mean = -0.06867, std = 0.7071)
eb28(rent_burden) Categorical false rent_burden Equal-width binning (number of bins = 28)
zsc(sig(rent_burden)) Numeric false rent_burden sigmoid( )
Standardization with z-score (mean = 1, std = 3.571e-10)
q28(age_over_80) Categorical false age_over_80 Equiprobable binning (number of bins = 28)
q28(family_dual_income) Categorical false family_dual_income Equiprobable binning (number of bins = 28)
q28(family_size) Categorical false family_size Equiprobable binning (number of bins = 28)
zsc(age_over_80./income_household_5_to_10) Numeric false age_over_80, income_household_5_to_10 age_over_80 ./ income_household_5_to_10
Standardization with z-score (mean = 2.0422, std = 1.3415)
eb28(age_10_to_19) Categorical false age_10_to_19 Equal-width binning (number of bins = 28)
q28(income_individual_median) Categorical false income_individual_median Equiprobable binning (number of bins = 28)
zsc(age_over_80./unemployment_rate) Numeric false age_over_80, unemployment_rate age_over_80 ./ unemployment_rate
Standardization with z-score (mean = 0.74942, std = 0.37691)
zsc(cos(income_household_50_to_75)) Numeric false income_household_50_to_75 cos( )
Standardization with z-score (mean = -0.012865, std = 0.69717)
eb25(race_pacific) Categorical false race_pacific Equal-width binning (number of bins = 25)
zsc(sin(patient_id)) Numeric false patient_id sin( )
Standardization with z-score (mean = -0.0018454, std = 0.70739)
zsc(race_native./race_multiple) Numeric false race_native, race_multiple race_native ./ race_multiple
Standardization with z-score (mean = 0.14079, std = 0.41944)
eb28(income_household_25_to_35) Categorical false income_household_25_to_35 Equal-width binning (number of bins = 28)
zsc(age_50s-income_household_75_to_100) Numeric false age_50s, income_household_75_to_100 age_50s – income_household_75_to_100
Standardization with z-score (mean = 0.77657, std = 2.1264)
zsc(cos(age_60s)) Numeric false age_60s cos( )
Standardization with z-score (mean = 0.05337, std = 0.75178)
q28(income_household_35_to_50) Categorical false income_household_35_to_50 Equiprobable binning (number of bins = 28)
eb21(race_black) Categorical false race_black Equal-width binning (number of bins = 21)
zsc(sin(income_individual_median)) Numeric false income_individual_median sin( )
Standardization with z-score (mean = 0.045145, std = 0.69873)
q28(age_50s) Categorical false age_50s Equiprobable binning (number of bins = 28)
q28(race_white) Categorical false race_white Equiprobable binning (number of bins = 28)
q28(age_under_10) Categorical false age_under_10 Equiprobable binning (number of bins = 28)
q28(disabled) Categorical false disabled Equiprobable binning (number of bins = 28)
zsc(patient_age./income_household_100_to_150) Numeric false patient_age, income_household_100_to_150 patient_age ./ income_household_100_to_150
Standardization with z-score (mean = 3.9266, std = 1.314)
q28(income_household_75_to_100) Categorical false income_household_75_to_100 Equiprobable binning (number of bins = 28)
zsc(sin(N02)) Numeric false N02 sin( )
Standardization with z-score (mean = 0.039533, std = 0.70149)
eb28(family_size) Categorical false family_size Equal-width binning (number of bins = 28)
q28(limited_english) Categorical false limited_english Equiprobable binning (number of bins = 28)
q28(income_household_100_to_150) Categorical false income_household_100_to_150 Equiprobable binning (number of bins = 28)
zsc(farmer.*race_black) Numeric false farmer, race_black farmer .* race_black
Standardization with z-score (mean = 10.7649, std = 26.8957)
zsc(home_value.*race_pacific) Numeric false home_value, race_pacific home_value .* race_pacific
Standardization with z-score (mean = 59826.8413, std = 128896.4218)
zsc(education_graduate.*health_uninsured) Numeric false education_graduate, health_uninsured education_graduate .* health_uninsured
Standardization with z-score (mean = 97.7642, std = 54.0304)

Split the Data

The last step before you can train a machine learning model is to split your data into a training and testing set. We’ll use the training data to fit the model, and the testing set to evaluate how well the model performs on new data before we use it to make a submission. Here I split the data into 80% training and 20% testing.
numRows = height(augTrainData);
[trainInd, ~, testInd] = dividerand(numRows, .8, 0, .2);
trainingData = augTrainData(trainInd, :);
testingData = augTrainData(testInd, :);

Train a Machine Learning Model

In this example, I’ll create a binary decision tree using the fitctree function and set ‘Optimize Hyperparameters’ to ‘auto’, which will attempt to minimize the error of our algorithm by choosing the best value for the ‘MinLeafSize’ parameter. It visualizes the results of adjusting this value, as can be seen below.
classificationTree = fitctree(trainingData, “DiagPeriodL90D”,
OptimizeHyperparameters=‘auto’);
|======================================================================================|
| Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | MinLeafSize |
| | result | | runtime | (observed) | (estim.) | |
|======================================================================================|
| 1 | Best | 0.18764 | 1.4699 | 0.18764 | 0.18764 | 1676 |
| 2 | Accept | 0.18764 | 0.87349 | 0.18764 | 0.18764 | 162 |
| 3 | Accept | 0.20923 | 1.005 | 0.18764 | 0.19426 | 36 |
| 4 | Accept | 0.29395 | 1.6132 | 0.18764 | 0.18764 | 3 |
| 5 | Accept | 0.18764 | 0.6073 | 0.18764 | 0.1876 | 491 |
| 6 | Accept | 0.38012 | 0.21492 | 0.18764 | 0.24104 | 4858 |
| 7 | Accept | 0.18764 | 0.60759 | 0.18764 | 0.18764 | 330 |
| 8 | Accept | 0.18764 | 0.36986 | 0.18764 | 0.18763 | 1033 |
| 9 | Accept | 0.19227 | 1.0609 | 0.18764 | 0.18762 | 80 |
| 10 | Accept | 0.24409 | 1.4868 | 0.18764 | 0.18761 | 13 |
| 11 | Accept | 0.18764 | 0.3479 | 0.18764 | 0.18568 | 1363 |
| 12 | Accept | 0.18764 | 0.70426 | 0.18764 | 0.1861 | 231 |
| 13 | Accept | 0.18764 | 0.48941 | 0.18764 | 0.18678 | 698 |
| 14 | Accept | 0.29519 | 2.1238 | 0.18764 | 0.18671 | 1 |
| 15 | Accept | 0.18764 | 0.35153 | 0.18764 | 0.18736 | 1438 |
| 16 | Accept | 0.18764 | 0.86203 | 0.18764 | 0.18735 | 119 |
| 17 | Accept | 0.18764 | 0.41595 | 0.18764 | 0.18734 | 849 |
| 18 | Accept | 0.18764 | 0.31486 | 0.18764 | 0.18737 | 1527 |
| 19 | Accept | 0.18764 | 0.60161 | 0.18764 | 0.18738 | 404 |
| 20 | Accept | 0.18764 | 0.45615 | 0.18764 | 0.18738 | 589 |
|======================================================================================|
| Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | MinLeafSize |
| | result | | runtime | (observed) | (estim.) | |
|======================================================================================|
| 21 | Accept | 0.18764 | 0.30864 | 0.18764 | 0.18745 | 1515 |
| 22 | Accept | 0.18764 | 0.71981 | 0.18764 | 0.18745 | 138 |
| 23 | Accept | 0.18764 | 0.62974 | 0.18764 | 0.18745 | 278 |
| 24 | Accept | 0.18764 | 0.27013 | 0.18764 | 0.18749 | 1511 |
| 25 | Accept | 0.18764 | 0.62894 | 0.18764 | 0.18749 | 196 |
| 26 | Accept | 0.18764 | 0.40254 | 0.18764 | 0.18749 | 811 |
| 27 | Accept | 0.30239 | 0.19617 | 0.18764 | 0.18741 | 2944 |
| 28 | Accept | 0.18764 | 0.27176 | 0.18764 | 0.18741 | 1170 |
| 29 | Accept | 0.18764 | 0.37273 | 0.18764 | 0.18747 | 1576 |
| 30 | Accept | 0.18764 | 0.45381 | 0.18764 | 0.18747 | 945 |__________________________________________________________
Optimization completed.
MaxObjectiveEvaluations of 30 reached.
Total function evaluations: 30
Total elapsed time: 50.9097 seconds
Total objective function evaluation time: 20.2308Best observed feasible point:
MinLeafSize
___________1676Observed objective function value = 0.18764
Estimated objective function value = 0.18815
Function evaluation time = 1.4699Best estimated feasible point (according to models):
MinLeafSize
___________1527Estimated objective function value = 0.18747
Estimated function evaluation time = 0.36237
I used a binary tree as my starting point, but it’s important to test out different types of algorithms to see what works best with your data! Check out the Classification Learner app documentation and this short video to learn how to train several machine learning models quickly and iteratively!

Test Your Model

There are many ways to evaluate the performance of a machine learning model, so in this blog I’ll show how to do so by computing validation accuracy and using testing data.

Validation Accuracy

Cross-validation is one method of evaluating a model, and at a high level is done by:
  1. Setting aside a subset of the training data, known as validation data
  2. Using the rest of the training data to fit the model
  3. Testing how well the model performs on the validation data
You can use the crossval function to do this:
% Perform cross-validation
partitionedModel = crossval(classificationTree, ‘KFold’, 5);
Then, extract the misclassification rate, and subtract it from 1 to get the model’s accuracy. The closer to 1 this value is, the more accurate our model is.
% Compute validation accuracy
validationAccuracy = 1 – kfoldLoss(partitionedModel, LossFun=‘ClassifError’)
validationAccuracy = 0.8124

Testing Data

In this section, we’ll use the ‘testingData’ dataset we created earlier. Similar to what we did with the validation data, we can use the loss function to compute the misclassification rate when you use the classification tree on the testing data, and subtract it from 1 to get a measure of accuracy.
testAccuracy = 1 – loss(classificationTree, testingData, “DiagPeriodL90D”,
LossFun=‘classiferror’)
testAccuracy = 0.8048
I also want to compare the predictions that the model makes to the actual outputs, so let’s remove the ‘DiagPeriodL90D’ variable from our testing data
testActual = testingData.DiagPeriodL90D;
testingData = removevars(testingData, “DiagPeriodL90D”);
Now, use the model to make predictions on the testing set
[testPreds, scores, ~, ~] = predict(classificationTree, testingData);
And use the confusionchart function to compare the predicted outputs to the actual outputs, to see how often they match or don’t.
confusionchart(testActual, testPreds)
This shows that it almost always predicts 1s correctly, or when the patient is diagnosed within 90 days, but it’s almost a 50/50 chance that this model will predict the 0s correctly.
We can also use the test data and predictions to visualize receiver operating characteristic (ROC) metrics. The ROC curve shows the true positive rate (TPR) versus the false positive rate (FPR) for different thresholds of classification scores. The “Model Operating Point” shows the false positive rate and true positive rate of the model.
rocObj = rocmetrics(testActual, scores, classificationTree.ClassNames);
plot(rocObj)
Here we can see that the classifier correctly assigns about 90-95% of the 1 class observations to 1 (TPR), but incorrectly assigns about 40% of the 0 class observations as 1 (FPR). This is similar to what we observed with the confusion chart.
You can also extract the area under the curve (AUC) value, which is a measure of the overall quality of the classifier. The AUC values are in the range 0 to 1, and larger AUC values indicate better classifier performance.
rocObj.AUC
The AUC is pretty high, but shows that there is definitely room for improvement. To learn more about ROC metrics, check out this documentation page that explains it in more detail.

Create Submission

Once you have a model that performs well on the validation and testing data, it’s time to create a submission for the datathon! As a reminder, you will upload this file to Kaggle to be scored on the leaderboard.
First, import the ‘Test’ dataset:
testDataFilename = ‘Test.csv’;
allTestData = readtable(fullfile(dataFolder, testDataFilename))
allTestData = 3999×83 table
patient_id patient_race payer_type patient_state patient_zip3 patient_age patient_gender bmi breast_cancer_diagnosis_code breast_cancer_diagnosis_desc metastatic_cancer_diagnosis_code metastatic_first_novel_treatment metastatic_first_novel_treatment_type Region Division population density age_median age_under_10 age_10_to_19 age_20s age_30s age_40s age_50s age_60s age_70s age_over_80 male female married
1 573710 ‘White’ ‘MEDICAID’ ‘IN’ 467 54 ‘F’ NaN ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C773’ NaN NaN ‘Midwest’ ‘East North Central’ 5.4414e+03 85.6210 40.8803 12.7323 14.0887 10.6597 11.6258 11.2081 15.6194 12.3226 8.4097 3.3435 49.1548 50.8452 55.1758
2 593679 ‘COMMERCIAL’ ‘FL’ 337 52 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C787’ NaN NaN ‘South’ ‘South Atlantic’ 1.9614e+04 1.5551e+03 49.1077 8.0692 8.5872 10.6846 11.3026 10.9718 15.8231 15.9026 11.8282 6.8154 49.6590 50.3410 44.8000
3 184532 ‘Hispanic’ ‘MEDICAID’ ‘CA’ 917 61 ‘F’ NaN ‘C50911’ ‘Malignant neoplasm of unsp site of right female breast’ ‘C773’ NaN NaN ‘West’ ‘Pacific’ 4.3030e+04 2.0486e+03 38.8522 11.3065 12.8978 14.1217 13.5326 13.1609 13.3783 11.4739 6.3804 3.7370 49.0522 50.9478 48.5043
4 184532 ‘Hispanic’ ‘MEDICARE ADVANTAGE’ ‘CA’ 917 61 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C779’ NaN NaN ‘West’ ‘Pacific’ 4.3030e+04 2.0486e+03 38.8522 11.3065 12.8978 14.1217 13.5326 13.1609 13.3783 11.4739 6.3804 3.7370 49.0522 50.9478 48.5043
5 447383 ‘Black’ ‘CA’ 917 64 ‘F’ 23 ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C779’ NaN NaN ‘West’ ‘Pacific’ 3.6054e+04 5.2943e+03 36.6538 9.7615 11.2677 17.2338 17.4415 13.0908 12.3046 9.4077 5.6738 3.8246 50.5108 49.4892 33.4785
6 281312 ‘COMMERCIAL’ ‘MI’ 483 64 ‘F’ 24 ‘1748’ ‘Malignant neoplasm of other specified sites of female breast’ ‘C7800’ NaN NaN ‘Midwest’ ‘East North Central’ 2.0151e+04 724.9353 42.0784 11.0392 13.0098 11.6431 11.8882 13.0647 15.1098 12.8686 7.4000 3.9588 49.2922 50.7078 54.0137
7 492714 ‘COMMERCIAL’ ‘TX’ 761 91 ‘F’ NaN ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C773’ NaN NaN ‘South’ ‘West South Central’ 2.9482e+04 1.3355e+03 33.6278 13.1611 15.3444 16.7250 15.2167 12.5361 11.4139 8.8583 4.4167 2.3250 47.6694 52.3306 43.4639
8 378266 ‘White’ ‘MEDICARE ADVANTAGE’ ‘IN’ 473 79 ‘F’ NaN ‘C50212’ ‘Malig neoplasm of upper-inner quadrant of left female breast’ ‘C773’ NaN NaN ‘Midwest’ ‘East North Central’ 5.2774e+03 296.8542 42.0763 11.0220 13.9932 12.1288 10.4949 12.3237 13.4797 13.8864 7.6407 5.0271 48.5627 51.4373 50.6559
9 291550 ‘COMMERCIAL’ ‘AZ’ 852 50 ‘F’ NaN ‘C50919’ ‘Malignant neoplasm of unsp site of unspecified female breast’ ‘C773’ NaN NaN ‘West’ ‘Mountain’ 3.5899e+04 1.1664e+03 41.8273 10.8364 12.3045 12.7114 12.7545 11.8909 13.0341 12.7659 9.1523 4.5614 49.7568 50.2432 51.5750
10 612272 ‘COMMERCIAL’ ‘CA’ 902 47 ‘F’ 24 ‘C50412’ ‘Malig neoplasm of upper-outer quadrant of left female breast’ ‘C7801’ NaN NaN ‘West’ ‘Pacific’ 3.5350e+04 3.5588e+03 38.7486 11.0686 13.8657 13.6371 13.7886 13.7000 13.0686 10.7143 6.4571 3.6943 49.2714 50.7286 45.4657
11 240105 ‘White’ ‘MEDICAID’ ‘CO’ 802 56 ‘F’ NaN ‘C50919’ ‘Malignant neoplasm of unsp site of unspecified female breast’ ‘C7931’ NaN NaN ‘West’ ‘Mountain’ 2.5754e+04 2.4639e+03 35.8175 10.1600 10.0575 18.0375 19.6900 13.9675 10.9375 8.9850 5.3875 2.7900 50.2375 49.7625 41.6875
12 277939 ‘White’ ‘MEDICAID’ ‘KY’ 401 44 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C7931’ NaN NaN ‘South’ ‘East South Central’ 4.9004e+03 64.2871 42.1097 11.2903 11.8742 12.5065 11.3323 11.7258 15.5484 13.4419 8.6645 3.6194 51.4742 48.5258 47.8613
13 504153 ‘COMMERCIAL’ ‘IL’ 600 52 ‘F’ NaN ‘1749’ ‘Malignant neoplasm of breast (female), unspecified’ ‘C7931’ NaN NaN ‘Midwest’ ‘East North Central’ 2.5744e+04 981.7631 41.7625 11.7846 13.8677 10.5738 11.3246 12.5923 15.0154 13.0277 7.8185 3.9908 49.9282 50.0708 57.2108
14 287269 ‘Asian’ ‘COMMERCIAL’ ‘IL’ 606 58 ‘F’ 23 ‘C50912’ ‘Malignant neoplasm of unspecified site of left female breast’ ‘C773’ NaN NaN ‘Midwest’ ‘East North Central’ 4.8671e+04 6.4314e+03 35.7554 10.4286 10.6518 18.3107 18.9036 11.9696 11.7268 9.6839 5.4071 2.8911 48.6964 51.3036 35.9304
Then we need to process this dataset in the same way that we did the training data. In this section, I use code instead of the live tasks for simplicity.
% replace cell arrays with categoricals
varTypes = varfun(@class, allTestData, OutputFormat=“cell”);
catIdx = strcmp(varTypes, “cell”);
varNames = allTestData.Properties.VariableNames;
catVarNames = varNames(catIdx);
for catNameIdx = 1:length(catVarNames)
allTestData.(catVarNames{catNameIdx}) = categorical(allTestData.(catVarNames{catNameIdx}));
end
% remove variables with too many missing data points
fullTestData = removevars(allTestData, [“bmi”, “metastatic_first_novel_treatment”, “metastatic_first_novel_treatment_type”]);
We also need to use the transform function to create the same features as we created using gencfeatures for the training data.
augTestData = transform(T, fullTestData);
Now that the data is in the format our machine learning model expects it to be in, use the predict function to make predictions, and create a table to contain the patient IDs and corresponding predictions.
submissionPreds = predict(classificationTree, augTestData);
submissionTable = table(fullTestData.patient_id, submissionPreds, VariableNames=[“patient_id”, “DiagPeriodL90D”])
submissionTable = 3780×2 table
patient_id DiagPeriodL90D
1 573710 1
2 593679 1
3 184532 1
4 447383 1
5 687972 1
6 281312 0
7 492714 1
8 378266 1
9 291550 1
10 612272 1
11 240105 1
12 277939 0
13 504153 0
14 287269 1
Last, export your predictions to a .CSV file, then upload to Kaggle for scoring.
writetable(submissionTable, “Predictions.csv”);
And that’s it! Thank you for following along with this tutorial, and best of luck to all participants. If you have any questions about this tutorial or MATLAB, reach out to us at studentcompetitions@mathworks.com or by tagging gracewoolson in the forum. Keep your eye out for our upcoming WiDS Workshop on January 31st, where we will walk through this tutorial and answer any questions you have along the way!

]]>
https://blogs.mathworks.com/student-lounge/2024/01/01/predicting-timely-diagnosis-of-metastatic-breast-cancer-for-the-wids-datathon-2024/feed/ 0
High School Students Tackle Mobility Challenges with an Award-winning Innovative Engineering Solution https://blogs.mathworks.com/student-lounge/2023/12/19/high-school-students-tackle-mobility-challenges-with-an-award-winning-innovative-engineering-solution/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2023/12/19/high-school-students-tackle-mobility-challenges-with-an-award-winning-innovative-engineering-solution/#respond Tue, 19 Dec 2023 08:58:09 +0000 https://blogs.mathworks.com/student-lounge/?p=10672

Welcome to our blog post! Today, we have the pleasure of introducing our guest blogger who will be sharing an exciting project called the HANDIWHEEL. This project was developed by a team of talented... read more >>

]]>
Welcome to our blog post! Today, we have the pleasure of introducing our guest blogger who will be sharing an exciting project called the HANDIWHEEL. This project was developed by a team of talented students from Louis Armand High School in Nogent-sur-Marne, France. Let’s dive into their remarkable journey and explore the innovative aspects of their award-winning creation.

What is the HANDIWHEEL project?

The HANDIWHEEL project is a remarkable endeavor that focuses on engineering and mobility. The team aimed to design a scaled-down prototype of a wheelchair capable of overcoming obstacles, particularly curbs several centimeters in height. This project not only showcases the students’ technical skills but also their dedication to making a positive impact on the lives of individuals with mobility challenges (as an illustration, refer to the article in the French News). By developing this innovative solution, the team strives to enhance accessibility and empower those with mobility limitations to navigate their surroundings with greater ease and independence.

How did the team develop the idea of the deployable wheel and how did they bring it to reality?

The team recognized the importance of overcoming sidewalks and curbs for wheelchair users. Inspired by this challenge, they conceptualized the idea of deployable front wheels. By utilizing a system of deployable wheels controlled by a servo motor, they devised a kinematic mechanism with 10 pivot joints. This ingenious solution allowed the front wheels to increase their diameters, enabling them to conquer obstacles effectively.

What tools did the teams use to go from concept to design?

To transform their concept into a tangible design, the team employed various tools and methodologies. They utilized CAD models and prototypes created at the school’s FabLab, utilizing 3D printing for the plastic parts and laser cutting for the wooden side walls. Additionally, they adopted the Model-Based Design (MBD) approach, leveraging Simulink, Stateflow and Simscape to develop models for motor control, sensor information management, control the robot movements and for sizing its energy autonomy.
Copie de Capture d’écran taille roue ouverte 2023-04-14 102739.pngRoue.png
Deployable wheel: CAD model and prototype done at the school FabLab
Modele.png
Simscape model

What was the team’s experience going from simulation to creating a prototype?

Transitioning from simulation to prototype involved significant effort and problem-solving. The students successfully connected their Simulink models to an Arduino board, allowing them to deploy the control program for motorization and sensor information management. They also utilized Stateflow to create a state diagram, enabling precise sequencing of the robot’s movements for obstacle climbing. Moreover, they addressed mechanical constraints by manufacturing the wheels using 3D printing and laser cutting techniques. Overcoming challenges such as assembly and rotational guidance, they found innovative solutions, including using bearings from the company IGUS.
Modele_Handiwheel.png
Electrical constraints required special attention to ensure proper connections and functionality between the battery, Arduino Mega board, servo motors, and motor drivers. The students learned soldering techniques and various types of electrical wiring to address these constraints
soudure.png
To incorporate the new mechanical evolutions and developments, the students successfully adapted the robot’s control program to incorporate the new mechanical developments, aligning it with the project specifications. This required understanding Stateflow charts and making necessary adjustments to ensure the smooth operation of the robot based on its updated mechanical design.
DE.png
State diagrams developed with Stateflow
GIF_2.gif

Conclusion

In conclusion, the HANDIWHEEL project embarked on an extraordinary human adventure. Under the guidance of their teachers Pierre Rabec and Mehdi Boughriet, the team consisted of highly motivated students: Luca Fontaine, Walid Atik, Matéo Guglielmi-Lacoux, Edern Deneuville, and Illya Liganov. These students have a passion for science and enjoy tackling challenges. Through their exceptional collective intelligence and unwavering teamwork, they brought the deployable wheel concept to life. From CAD modeling to simulation and prototype creation, their journey exemplifies the power of innovation and determination. We applaud their remarkable achievements and look forward to witnessing the positive impact of the HANDIWHEEL project on individuals with mobility challenges.
photo_academique.png
HANDIWHEEL Team

]]>
https://blogs.mathworks.com/student-lounge/2023/12/19/high-school-students-tackle-mobility-challenges-with-an-award-winning-innovative-engineering-solution/feed/ 0
Drowsy – Sleep Analytics and Improvement through Technology https://blogs.mathworks.com/student-lounge/2023/12/11/drowsy-sleep-analytics-and-improvement-through-technology/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2023/12/11/drowsy-sleep-analytics-and-improvement-through-technology/#respond Mon, 11 Dec 2023 08:31:12 +0000 https://blogs.mathworks.com/student-lounge/?p=10630

Today we are joined by Jonathan Wang, Andrew Fu, Eric Liu, and Suparn Sathya who won the “Best Use of MATLAB” award at HackDavis 2023. Their app tries to make maintaining a healthy sleeping schedule... read more >>

]]>
Today we are joined by Jonathan Wang, Andrew Fu, Eric Liu, and Suparn Sathya who won the “Best Use of MATLAB” award at HackDavis 2023. Their app tries to make maintaining a healthy sleeping schedule fun and intuitive by tracking and showing comprehensive sleep data. Over to the team to explain more…
From left to right: Jonathan Wang, Andrew Fu, Eric Liu, Suparn Sathya

Inspiration

We were inspired by a common occurrence among college students: sleep deprivation. Staying up late to squeeze in some studying is super common, especially around exam season, but sacrificing sleep for extra study hours is counterproductive. Lack of sleep affects memory, concentration, and creativity, and impacts a student’s academic success and mental health.
To combat this, we thought of creating a sleep tracking game. Our goal was to encourage students to sleep more with in-game rewards. Additionally, we intended to design the app to offer detailed insights into the quality of their nightly sleep and provide advice on enhancing sleep quality for the future.

Breaking down the problem

After we identified sleep deprivation as a common issue that our project could be centered on, we decided to look towards existing products that each had their pros and cons to understand what we needed to develop an effective solution. Based on our research, we found that most existing solutions do very little to ensure consistency and quality of sleep hours but instead rewarded users for waking up daily which doesn’t tackle the core issue of lack of sleep. Even so, another issue was that there was little incentive being provided to users to maintain healthy sleep schedules. To resolve both of these limitations, we decided that the most optimal way to develop a solution would be as a webapp that would be able to access user sleep data. On the webapp, users would be able to view audio analysis of their sleep which may be an indicator of overall quality. Based on the time users go to sleep and wake up, the webapp would “game-ify” their sleep by rewarding them with virtual points that can be redeemed through an online store.

How did we implement it?

Very early into the hackathon, we decided to integrate MATLAB into our project because of its powerful signal processing capabilities. We had no solid pointers on how to get started with designing a sleep analysis algorithm, so the majority of our time was spent on prototyping and testing our data against MATLAB’s various filters and feature extraction functions. The final version of our code is a sleep staging algorithm that takes in an audio recording of someone’s breathing for the duration of their sleep.
The main challenge was identifying every instance of a breath. The solution we came up with involved performing peak analysis on the audio data. Keep in mind the majority of our technical design choices were made either due to simplicity or efficiency; the code below demonstrates some of these aspects.
% Reads audio data. Default sampling rate is 44100 Hz.
[source, samplingRate] = audioread(“Sleep_2023-5-21.wav”);
% Reduces audio data sampling rate to 60Hz and truncates total duration to the nearest minute.
sourceLength = length(source);
overflow = mod(sourceLength, 3600);
source = abs(source(1 : samplingRate / 60 : end – overflow));
% Evenly splits audio data into 60 second samples.
source = reshape(source, 3600, []);
We split the audio data into minute by minute chunks instead of attempting to perform analysis on the entire signal mainly due to the fact that the performance of certain functions such as envelope slows exponentially with increasing input size, but besides this, partitioning the data by minutes allows as discretize the computation and analysis of respiratory rate. To further speed up the program’s performance, we also resampled the data in 60Hz instead of the default 44100 Hz. This reduces the total number of data points by a factor of 735.
% Removes outliers from raw data.
source = hampel(source);
% Creates peak envelopes with peak separations of 60 samples.
source = envelope(source, 60, “peak”);
In an audio signal, the inhalation and exhalation process is represented by a dense pocket of noise. Knowing this, we settled on running our audio data through the hampel filter, which sparingly removes outliers, generally only targeting data points outside of the noise pockets. After applying the filter, we fit a peak envelope onto the raw signal to extract the shape of the noise pockets; the envelope’s peak separation of 60 samples implicitly assumes that respiratory rate doesn’t exceed 60 breaths per minute (normal respiratory rate while sleeping is 12-20 per minute). Performing these transformations helps minimize the impact of noise and prepares the data for peak analysis.
% Iterates over every minute of audio data. Gets total number of peaks from enveloped data for each minute.
numCols = size(source, 2);
respiratoryRate = zeros(1, numCols);
for i = 1 : numCols
pks = findpeaks(source(:, i));
respiratoryRate(i) = length(pks);
end
% Identifies 15 occurrences of the most abrupt changes in respiratory rate over the full course of sleep, given each change happens at least 15 minutes apart.
ipt = findchangepts(respiratoryRate, MaxNumChanges = 15, MinDistance = 15);
rrLength = length(respiratoryRate);
ipt = [1, ipt, rrLength];
values = zeros(1, rrLength);
for i = 1 : length(ipt) – 1
values(ipt(i) : ipt(i + 1)) = mean(respiratoryRate(ipt(i) : ipt(i + 1)));
end
The number of peaks in the signal every minute is equal to the number of breaths taken that minute, also known as the respiratory rate. Respiratory rate is expected to vary minimally, which is why we incorporated code that averages these values in steps. The findchangepts function flattens the respiratory rate data into 15 steps, with each step covering at least 15 minutes. We chose these parameters for the function semi-arbitrarily with the intent of morphing the result into a textbook-representation of a sleep-cycle graph.

Results

The audio data for the plot below, represented in light-gray, is a 1 minute excerpt of Andrew’s sleep. It demonstrates our program’s peak analysis algorithm.
The envelope, represented by the black lines, wraps around most of the source’s peaks. The peaks that do not fall under the envelope had been filtered out by the hampel filter, which suggests that they were outliers, and most likely unwanted impulse noise. All of the envelope’s peaks are marked with an asterisk. Counting the number of asterisks yields the respiratory rate. For this plot, the respiratory rate also represents a data point in Andrew’s 7 hour sleep during the hackathon shown in the plot below.
Respiratory rate is represented by the light-gray signal, which is visibly very noisy. Because variations are typically ±1, they are likely to be caused simply by sampling error. The black line represents a smoothed version of the respiratory rate signal. With the knowledge that respiratory rate generally changes depending on sleep cycle stages, respiratory rate plots could be used practically to track sleep cycles.
All of us have tested the peak analysis algorithm extensively and found it to be generally tolerant to background noise and low breathing volume while also being highly accurate in correctly identifying the number of breaths taken. Accuracy degrades when the algorithm samples audio longer than a minute, which is another reason why choosing to discretize respiratory rate by a minute-to-minute basis was a good design choice. Our final result is a sleep stage graph, the data for which is exported to our web app, where it is rerendered by its frontend.

Key Takeaways

The main things we learned from our project was to combine various sub-projects into our main project so that we could add as much functionality to our projects. We each played to our strengths whether it was backend, frontend, or the MATLAB and then combined those different parts into our final project. Overall, it was a great experience because we picked up new skills such as NextJS, while also learning about how to integrate files in different languages into our main project.

]]>
https://blogs.mathworks.com/student-lounge/2023/12/11/drowsy-sleep-analytics-and-improvement-through-technology/feed/ 0
Kelp Wanted Challenge Starter Code https://blogs.mathworks.com/student-lounge/2023/11/29/kelp-wanted-challenge-starter-code/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2023/11/29/kelp-wanted-challenge-starter-code/#respond Wed, 29 Nov 2023 13:57:11 +0000 https://blogs.mathworks.com/student-lounge/?p=10597

Getting Started with MATLAB We at MathWorks, in collaboration with DrivenData, are excited to bring you this challenge! The goal is to develop an algorithm that can use provided satellite imagery to... read more >>

]]>

Getting Started with MATLAB

We at MathWorks, in collaboration with DrivenData, are excited to bring you this challenge! The goal is to develop an algorithm that can use provided satellite imagery to predict where kelp is present and where it is not. Kelp is a type of seaweed or algae that often grows in clusters known as kelp forests, which provide shelter and stability for many coastal ecosystems. The presence and growth of kelp is an important measurement for evaluating the health of these ecosystems, so the ability to easily monitor kelp forests could be a huge step forward in coastal climate science. In this blog, we will explore the data using the Hyperspectral Viewer app, preprocess the dataset, then create, evaluate, and use a basic semantic segmentation model to solve this challenge.
To request your complimentary MATLAB license and access additional learning resources, check out this website!

Table of Contents:

  1. Explore and Understand the Data
  2. Import the Data
  3. Preprocess the Data
  4. Design and Train a Neural Network
  5. Evaluate the Model
  6. Create Submissions

Explore and Understand the Data

Instructions for accessing and downloading the competition data can be found here.

The Input: Satellite Images

The input data is a set of augmented satellite images that have seven layers or “bands”, so you can think of it as 7 separate images all stacked on top of each other, as shown below
Each band is looking at the same exact patch of earth, but they each contain different measurements. The first 5 bands contain measurements taken at different wavelengths of the light spectrum, and the last two are supplementary metrics to better understand the environment. The following list shows what each of the seven bands measures:
  1. Short-wave infrared (SWIR)
  2. Near infrared (NIR)
  3. Red
  4. Green
  5. Blue
  6. Cloud Mask (binary – is there cloud or not)
  7. Digital Elevation Model (meters above sea-level)
Typically, most standard images just measure the red, green, and blue values, but by including additional measurements, hyperspectral images can enable us to identify objects and patterns that may not be easily seen with the naked eye, such as underwater kelp.
kelp-example.png
[Left: A true color image of an example tile using the RGB bands. Center: A false color image using the SWIR, NIR, and Red bands. Right: The false color image with the labeled kelp mask overlayed in cyan.]
Let’s read in a sample image and label for tile ID AA498489, which we will explore to gain a better understanding of the data.
firstImage = imread(‘train_features/AA498489_satellite.tif’);
firstLabel = imread(‘train_labels/AA498489_kelp.tif’);

The Spectral Bands (1-5)

Let’s start by exploring the first five layers. The rescale function adjusts the values of the bands so that they can be visualized as grayscale images, and the montage function displays each band next to each other.
montage(rescale(firstImage(:, :, 1:5)));
Here we can see that there’s some land masses present, and that the SWIR and NIR bands have higher values than the red, green, and blue bands when looking at this patch of earth, as they are brighter. This doesn’t tell us much about the data, but gives us an idea of what we are looking at.

Hyperspectral Viewer

You can use the Hyperspectral Viewer app to further explore the first five layers. Note that this requires the Image Processing Toolbox™ Hyperspectral Imaging Library, which can be installed through the Add-On Explorer, and is not supported in MATLAB Online. The center wavelengths shown below are approximated from this resource.
firstImSatellite = firstImage(:, :, 1:5);
centerWavelengths = [1650, 860, 650, 550, 470]; % in nanometers
hcube = hypercube(firstImSatellite, centerWavelengths);
hyperspectralViewer(hcube);
When the app opens, you’ll have the ability to view single bands on the left pane and various band combinations on the right. Note that the bands are shown in order of wavelength, not in the order they are loaded, so in the app the bands are in reverse order. Band 1 = Blue, Band 5 = SWIR.
On the left pane, you can scroll through and view each band one at a time. You can also manually adjust the contrast to make it easier to see or to make it representative of a different spectrum than the default.
ExploreBands.gif
On the right, you’ll have the ability to see False Color, RGB, and CIR images. RGB images are just standard color images, and show the earth as we would see it from a typical camera. False Color and CIR images convert the measurements from the SWIR and NIR bands, which are not visible from the human eye, to colors that we can see. You can manually adjust the bands to create custom images as well.
In this pane, you also have the ability to create spectral plots for a single pixel, which shows what value that pixel holds for each band. Since this image has land, sea, and coast, I’ll create spectral plots for a pixel in each of these areas to see how they differ.
BandCombos.gif
This app also provides the ability to plot and interact with various spectral indices that calculate different measurements related to vegetation, which could provide helpful additional information when looking for kelp. Learn more about these spectral indices by checking out this documentation link.
Indices.gif
If you have some plots that you’d like to work with further, you can export any of these to the MATLAB workspace. I’ll use the RGB image in a moment, so let’s export it.
Export.gif

The Physical Property Bands

The other two layers of the input images are not based on the light spectrum, but on physical properties. The cloud mask can be visualized as a black-and-white image, where black means there was no cloud present and white means there was cloud blocking that part of the image.
cloudMask = firstImage(:, :, 6);
imshow(double(cloudMask));
This image is almost all black, so there was very little cloud blocking the satellite, but there are a few white pixels as highlighted in the image below.
The elevation mask can be visualized using the imagesc function, which will colorize different parts of the image based on how high above sea level each pixel is. As one might expect, the highest elevation in our image correlates to the large land mass.
elevationModel = firstImage(:, :, 7);
imagesc(elevationModel);
colormap(turbo);
colorbar;

The Output: A Binary Mask

The corresponding label for this satellite image is a binary mask, similar to the cloud mask. It is 350×350 – the same height and width of the satellite images – and each pixel is labeled as either 1 (kelp detected) or 0 (no kelp detected).
imshow(double(firstLabel))
You can add these labels over the RGB satellite image we exported earlier to see where the kelp is in relation to the land masses.
labeledIm = labeloverlay(rgb, firstLabel);
imshow(labeledIm);

Import the Data

To start working with all of the data in MATLAB, you can use an imageDatastore and pixelLabelDatastore. pixelLabelDatastore expects uint8 data, but the labels are currenlty int8, so I’ve created a custom read function (readLabelData) to convert the label data to the correct format.
trainImagesPath = ‘./train_features’;
trainLabelsPath = ‘./train_labels’;
allTrainIms = imageDatastore(trainImagesPath);
classNames = [“nokelp”, “kelp”];
pixelLabelIDs = [0, 1];
allTrainLabels = pixelLabelDatastore(trainLabelsPath, classNames, pixelLabelIDs, ReadFcn=@readLabelData);
Now we can divide the data into a training, validation, and testing data. The training set will be used to train our model, the validation set will be used to check in on training and make sure the model is not overfitting, and the testing set will be used after the model is trained to see how well it generalizes to new data.
numObservations = numel(allTrainIms.Files);
numTrain = round(0.7 * numObservations);
numVal = round(0.15 * numObservations);
trainIms = subset(allTrainIms, 1:numTrain);
trainLabels = subset(allTrainLabels, 1:numTrain);
valIms = subset(allTrainIms, (numTrain + 1):(numTrain + numVal));
valLabels = subset(allTrainLabels, (numTrain + 1):(numTrain + numVal));
testIms = subset(allTrainIms, (numTrain + numVal + 1):numObservations);
testLabels = subset(allTrainLabels, (numTrain + numVal + 1):numObservations);

Preprocess The Data

Clean up the sample image

Now that we have a better understanding of our data, we can preprocess it! In this section, I will show some ways you can:
  1. Resize the data
  2. Normalize the data
  3. Augment the data
While ideally each image in the dataset will be the same size, data is messy, and this isn’t always the case. I’ll use imresize to ensure the height and width of each image is correct.
inputSize = [350 350 8];
firstImage = imresize(firstImage, inputSize(1:2));
Each band has a different minimum and maximum, so while a 1 may be low for some bands it could be a high value for other bands. Let’s go through each layer (except for the cloud mask) and rescale it so that the minimum values are 0 and the maximum values are 1. There are many ways to normalize your data, so I suggest testing out other algorithms.
normalizedImage = zeros(inputSize); % preallocate for speed
continuousBands = [1 2 3 4 5 7];
for band = continuousBands
normalizedImage(:, :, band) = rescale(firstImage(:, :, band));
end
normalizedImage(:, :, 6) = firstImage(:, :, 6);
You can also use the provided data to create more data! This is called feature extraction. Since I know that kelp is often found along coasts, I’ll use an edge detection algorithm to show the edges that exist in the image, which will often include coastlines.
normalizedImage(:, :, 8) = edge(firstImage(:, :, 4), “sobel”);
Now we can view our preprocessed data!
montage(normalizedImage)

Apply Preprocessing to the Entire Dataset

To make sure these preprocessing steps are applied to every image in the dataset, you can use the transform function. This allows you to apply a function of your choice to each image as it is read, so I have defined a function cleanSatelliteData (shown at the end of the blog) that applies these steps to every image.
trainImsProcessed = transform(trainIms, @cleanSatelliteData);
valImsProcessed = transform(valIms, @cleanSatelliteData);
Then we combine the input and output datastores so that each satellite image can easily be associated with it’s expected output.
trainData = combine(trainImsProcessed, trainLabels);
valData = combine(valImsProcessed, valLabels);
If you preview the resulting datastore, the satellite images are now 350x350x8 instead of 350x350x7 since we added a band in the transformation function.
firstSample = preview(trainData)
firstSample = 1×2 cell
1 2
1 350×350×8 double 350×350 categorical

Design and Train a Neural Network

Create the network layers

Once the data is ready, it’s time to create a neural network.I’m going to create a simple network for semantic segmentation using the segnetLayers function.
numClasses = 2;
lgraph = segnetLayers(inputSize, numClasses, 5);

Balance the Classes

In the sample “firstImage”, there were a lot of pixels with the 0 label, meaning no kelp was detected. Ideally, we would have equal amounts of “kelp” and “nokelp” labels so that the network would learn each equally, but most images probably don’t show 50% or more kelp. To see the exact distribution of class labels in the dataset, use countEachLabel, which counts the number of pixels by class label.
labelCounts = countEachLabel(trainLabels)
labelCounts = 2×3 table
Name PixelCount ImagePixelCount
1 ‘nokelp’ undefined undefined
2 ‘kelp’ undefined undefined
‘PixelCount’ shows how many total pixels contained that class, and ‘ImagePixelCount’ shows the total number of pixels in all images that contained that class. This shows that not only are there way more “nokelp” labels than “kelp” labels, but also that there are images that don’t contain any “kelp” labels. If not handled correctly, this imbalance can be detrimental to the learning process because the learning is biased in favor of “nokelp”. To improve training, you can use class weights to balance the classes. Class weights define the relative importance of each class to the training process, and by default is set to 1 for each class. By assigning class weights that are inversely proportional to the frequency of each class (i.e., giving the “kelp” class a higher weight than “nokelp”), we reduce the chance of the network having a strong bias towards more common classes.
Use the pixel label counts from above to calculate the median frequency class weights:
imageFreq = labelCounts.PixelCount ./ labelCounts.ImagePixelCount;
classWeights = median(imageFreq) ./ imageFreq
You can then pass the class weights to the network by creating a new pixelClassificationLayer and replacing the default one.
pxLayer = pixelClassificationLayer(‘Name’,‘labels’,‘Classes’,labelCounts.Name,‘ClassWeights’,classWeights);
lgraph = replaceLayer(lgraph,“pixelLabels”,pxLayer);

Train the Network

Specify the settings you want to use for training with the trainingOptions function, and train the network!
tOps = trainingOptions(“sgdm”, InitialLearnRate=0.001,
MiniBatchSize=32,
MaxEpochs=5,
ValidationData=valData);
trainedNet = trainNetwork(trainData, lgraph, tOps);
This is an example of training a neural network from the command line, but if you want to explore your neural networks visually or go through the deep learning steps interactively, check out the Deep Network Designer app documentation and starter video!

Evaluate the Model

To test the quality of your model before submission, you need to process your testing data (which we created earlier) the same way you processed your training data
testIms = transform(testIms, @cleanSatelliteData);
We need to create a folder to contain the predictions
if ~exist(‘evaluationTest’, ‘dir’)
mkdir evaluationTest;
end
Then we make predictions on the test data!
allPreds = semanticseg(testIms,trainedNet,
MiniBatchSize=32,
WriteLocation=“evaluationTest”);
Running semantic segmentation network
————————————-
* Processed 846 images.
Once we have a set of predictions, we can use the evaluateSemanticSegmentation function to compare the predictions with the actual labels and get a sense of how well the model will perform on new data.
metrics = evaluateSemanticSegmentation(allPreds,testLabels);
Evaluating semantic segmentation results
—————————————-
* Selected metrics: global accuracy, class accuracy, IoU, weighted IoU, BF score.
* Processed 846 images.
* Finalizing… Done.
* Data set metrics:GlobalAccuracy MeanAccuracy MeanIoU WeightedIoU MeanBFScore
______________ ____________ _______ ___________ ___________0.94677 0.52232 0.47932 0.94021 0.15665
To understand how often the network predicted each class correctly and incorrectly, we can extract the confusion matrix. In a confusion matrix:
  • The rows represent the actual class.
  • The columns represent the predicted class.
metrics.ConfusionMatrix
ans = 2×2 table
nokelp kelp
1 nokelp undefined undefined
2 kelp undefined undefined
To learn more about these metrics, check out this documentation page and scroll down to the “Name-Value Arguments” section.

Create Submissions

When you have a model that you’re happy with, you can use it on the submission test dataset and create a submission! First, specify the folder that contains the submission data and create a new folder to hold your predictions.
testImagesPath = ‘./test_features’;
if ~exist(‘test_labels’, ‘dir’)
mkdir test_labels;
end
outputFolder = ‘test_labels/’;
Since the submissions need to have a specific name and filetype, we’ll use a for loop to go through all of the submission images, use the network to make a prediction, and write the prediction to a file.
testImsList = ls([testImagesPath ‘/*.tif’]);
testImsCount = size(testImsList, 1);
for testImIdx = 1:testImsCount
% import test image
testImFilename = testImsList(testImIdx, :);
testImPath = fullfile(testImagesPath, testImFilename);
rawTestIm = imread(testImPath);
% Extract tile ID from filename
[filenameParts] = split(testImFilename, “_”);
tileID = filenameParts{1}
testLabelFilename = [tileID ‘_kelp.tif’];
% process and predict on test image
testIm = cleanSatelliteData(rawTestIm);
numericTestPred = semanticseg(testIm,trainedNet, OutputType=“uint8”);
% convert from categorical number (1 and 2) to expected (0 and 1)
testPred = numericTestPred – 1;
% Create TIF file and export prediction
filename = fullfile(outputFolder, testLabelFilename);
imwrite(testPred, filename);
end
Then, use the tar function to compress the folder to an archive for submission.
tar(‘test_labels.tar’, ‘test_labels’);
Thank you for following along! This should serve as basic starting code to help you to start analyzing the data and work towards developing a more efficient, optimized, and accurate model using more of the training data available, and we are excited to see how you will build upon it and create models that are uniquely yours. Note that this model was trained on a subset of the data, so the numbers and individual file and folder names may be different than what you see when you use the full dataset.
Feel free to reach out to us in the DrivenData forum if you have any further questions. Good luck!

Helper Functions

function labelData = readLabelData(filename)
rawData = imread(filename);
rawData = imresize(rawData, [350 350]);
labelData = uint8(rawData);
end
function outIm = cleanSatelliteData(satIm)
inputSize = [350 350 8];
satIm = imresize(satIm, inputSize(1:2));
outIm = zeros(inputSize); %preallocate for speed
continuousBands = [1 2 3 4 5 7];
for band = continuousBands
outIm(:, :, band) = rescale(satIm(:, :, band));
end
outIm(:, :, 6) = satIm(:, :, 6);
outIm(:, :, 8) = edge(satIm(:, :, 4), “sobel”);
end

]]>
https://blogs.mathworks.com/student-lounge/2023/11/29/kelp-wanted-challenge-starter-code/feed/ 0
An Autonomous Quadruped Manipulator for Pick and Place Applications https://blogs.mathworks.com/student-lounge/2023/11/20/an-autonomous-quadruped-manipulator-for-pick-and-place-applications/?s_tid=feedtopost https://blogs.mathworks.com/student-lounge/2023/11/20/an-autonomous-quadruped-manipulator-for-pick-and-place-applications/#respond Mon, 20 Nov 2023 14:38:59 +0000 https://blogs.mathworks.com/student-lounge/?p=10533

For today’s post Roberto Valenti, who leads the MathWorks Challenge Project program will talk about a senior design class project at University of Sheffield. Over to you,... read more >>

]]>

For today’s post Roberto Valenti, who leads the MathWorks Challenge Project program will talk about a senior design class project at University of Sheffield. Over to you, Roberto…

At the University of Sheffield’s campus, a team of eight motivated students enthusiastically embraced a real-world project as part of their senior design class. Motivated by their shared interest in robotics and a desire for gaining practical experience, they tackled an ambitious challenge project that would not only push the boundaries of their technical skills but also prepare them for promising careers in the field of autonomous robotics. The project is from the MATLAB and Simulink Challenge Project Hub, a platform that aims to bring together academia and industry to encourage innovation and bridge the gap between academic education and real-world applications.
team0.jpg
Figure 1: The Team. From left to right: Serena Cicin-Sain, Olivia Organ, Will Foster, Joseph Moore, and Sherif Sawwaf. Not present in the picture: Harry Armstrong, Oluwaseun Adewola,and Josh Orme-Herbert

“This project has proven to be immensely valuable as it allowed us to comprehensively navigate an entire project lifecycle, engaging in all facets, including research, design, integration, and project management, all while adhering to a Systems Engineering approach. The endeavor presented significant challenges, necessitating the group to proactively demonstrate initiative through extensive research and innovative problem-solving. A team member capitalized on this experience to demonstrate their skills to potential employers, securing a job in an engineering role. The project, marked by its challenges, has also been enjoyable and rewarding, particularly given the positive feedback we have received. Our group takes considerable pride in our collective accomplishments, recognizing that it was our unwavering determination and seamless teamwork that made it all possible. We extend our sincere gratitude to our supervisor, Payam Soulatiantork, and Roberto Valenti from MathWorks, both of whose invaluable knowledge and guidance were indispensable to the project’s success”.

Take a look at the students’ submission GitHub repository open-in-matlab-online.gif.

Introduction

The selected challenge project required the team to design, model, and simulate an autonomous quadruped robot equipped with a robotic manipulator to perform sophisticated pick and place tasks. To effectively accomplish this challenge the team organized itself into four sub-teams, each working on one of the four subsystems: quadruped, manipulator, navigation, and perception. Work done on each subsystem by a specific team was integrated into a single autonomous robotic system in which a high-level state machine commands each task of the robot while keeping track of the system state. During the project development phase, efforts were coordinated and supervised to ensure timely completion while prioritizing tasks and member responsibilities to meet the predefined requirements. Risk assessments were also factored in to mitigate potential challenges.

Quadruped modeling and simulation

The quadruped sub-team designed a robot replicating animal gaits, focusing on a diagonal advanced placement akin to dogs and horses. Their design featured four legs with three links and joints, ensuring coordinated movement within a one-cubic-meter volume. This design allowed efficient locomotion and balance, including a stable stationary gait. The team’s inspiration from animal locomotion principles enabled effective movement of the robot and led to successful implementation. The team began with a
Simscape™ Multibody™ model of a quadruped limited to forward motion in the horizontal plane. Their goals included: adding a shoulder joint for omnidirectional movement, designing control systems for joint actuation, and creating a walking plan with tailored gait phases for complex turning maneuvers.
The team explored different motion control strategies, including the Inverse Kinematic strategy and the Raibert strategy. The Raibert strategy uses a dynamic quadruped motion control model that splits the robot’s control into velocity, body rotation, and hopping height components, resulting in a dynamic trotting gait model. This approach was adopted due to its simplicity and functionality and implemented using a high-level Stateflow® controller that integrates various walking and turning states of the quadruped into the final system.
Once the quadruped successfully achieved turning movements while maintaining stability, the team implemented a high-level Stateflow task planner for enhanced navigation. This task planner determines the quadruped’s movement mode (stationary, walking straight, turning left, or turning right) based on the relative angle and distance to the target coordinates, enabling effective navigation and obstacle avoidance. A demonstration of the quadruped locomotion including a visual of the walking controller and the high-level task planner is shown in Figure 2.
The successful implementation of a stable walking and turning control algorithm, was a critical achievement for the quadruped project and served as a foundation for integrating the other subsystems.
quadrupedLocomotion.gif
Figure 2: Quadruped walking gait demonstration. (Top) High-Level Stateflow diagram for motion mode transition and execution, (Bottom Left) Stateflow walk cycle for the Raibert strategy mode, (Bottom Right) Quadruped 3D visualization.

Manipulation

The manipulator sub-team adopted a Kinova Gen3 robotic manipulator for this project. The manipulator model was scaled using CAD software to meet the project requirements and the URDF files were then loaded into MATLAB® for creating a Simscape model with bodies and revolute joints. A closed-loop control system with PI control was implemented for each joint to ensure smooth and precise movement.
manipulatorModel.png
Figure 3: The end effector. (Left) Simscape model, (Right) full manipulator with attached the end effector.
To enable object interaction, an end effector with three fingers was attached to the manipulator. The end effector has two revolute joints per finger, as shown in Figure 2, allowing it to mimic human fingers and securely grasp objects. Inverse kinematics calculations were performed using a rigid body tree and forward kinematics models. A forward kinematics model was generated to determine the position and rotation of the end effector based on the joint angles. An inverse kinematics solver utilizing the Broyden-Fletcher-Goldfarb-Shanno (BFGS) Gradient Projection algorithm was implemented to find the joint angles required to reach a given transform.
A high-level controller facilitates object picking within the defined environment. The manipulator subsystem (Figure 4) includes a state variable for communication between the quadruped and manipulator. This variable determines when the quadruped is positioned next to the object or the desired end position.
The manipulator task planner manages the order of operations and ensures that the manipulator reaches the correct position before proceeding. It coordinates the movements of the manipulator joints and monitors the states of all subsystems.
fullManipulatorModel.png
Figure 4: The full manipulation subsystem model.
With the manipulator subsystem implemented, the manipulator is able to pick up objects within the environment. The end effector, equipped with three fingers, curls around an object to secure it in place. Contact forces between the floor, object, and manipulator were simulated for realistic interactions. A complete pick and place task of the manipulator subsystem is shown in Figure 5. The manipulator subsystem was integrated with the quadruped, allowing communication between them.
Manipulator.gif
Figure 5: A ball picking demonstration.

Navigation

The navigation sub-team adopted the RRT* algorithm for path planning, enabling the quadruped to navigate through user-defined environments. The RRT* algorithm optimized paths for efficiency, combining two planned paths, namely start to pick-up and pick-up to drop-off, into one array as shown Figure 6.
GUIandPath.png
Figure 6: (left): Functional GUI with start, pick up and place locations, and obstacle blocks, (middle): occupancy grid path from start to object pick-up position, (right): occupancy grid path from pick-up to end place.
The environment model was constructed in Simulink® based on a Graphical User Interface (GUI) input. The GUI allows intuitive placement of start and end points, obstacles, and pick-up locations, forming a visual representation of the environment. Rigid transform blocks position the quadruped, ball, and podium accurately based on this environment model. Figure 7 demonstrates the user’s interaction with the GUI, illustrating the creation of an environment map and object placement, followed by automatic generation of the 3D scenario and a navigation path.
Controller optimizations included replacing PID controllers with 1-D lookup tables for manipulator joints, ensuring smoother manipulative actions, and implementing a feedback-based braking system to fix the ball securely to the manipulator’s palm. The high-level control logic was managed by the task planner implemented in Stateflow, which orchestrates tasks such as path following, ball pick-up, placement, and simulation termination.
GUIInteraction.gif
Figure 7: A demonstration of a user interaction with the GUI: Creating an environment map, placing objects, and triggering automatic generation of a 3D scenario and navigation path.

Integration

In the final stages of the project, the team achieved seamless integration of various subsystems to create a functioning autonomous robotic system. The Simulink model illustrating this integration is presented in Figure 8. The GUI played a pivotal role in generating intricate maps and setting waypoints. The Path Planning subsystem, coupled with a Target Selector, enables the quadruped to navigate precisely, ensuring it follows the designated path accurately.
The Path Follower Stateflow module makes real-time decisions based on inputs from the Target Selector and the Angle and Distance Finder. It orchestrates intricate movements, including turns, halts, and forward strides, ensuring the quadruped’s adherence to the predefined route.
At the core of this integration, the Task Controller Stateflow oversees the entire process. It monitors the position index, ensuring the quadruped reaches each waypoint as planned. Additionally, it manages the initiation of complex tasks like picking up and dropping off the ball, synchronizing these actions seamlessly within the overall movement plan.
The Quadruped Movement Controller plays a crucial role in executing precise joint movements. These movements are meticulously calculated to maintain balance and stability, ensuring the quadruped’s movements are both accurate and secure.
The strategic placement of the manipulator at the quadruped’s enables a balance between stability and functionality, allowing the manipulator to perform tasks efficiently without compromising the quadruped’s stability during locomotion.
This detailed integration process highlights the team’s technical expertise, showcasing their learnings in system integration that turns subsystem elements into a unified and operational robotic system. Figure 9 shows a simulation of the final integrated model of the full system in action.
fullModel.png
Figure 8: The full system Simulink model
FullTask.gif
Figure 9: The full system in action.

Conclusion

Working on this Challenge Problem provided the team with an opportunity to learn several real-world skills. They learned how to break down a large complex robotics problem into smaller subsystems, design and implement those subsystems, and integrate them into an overall functioning system. Throughout this approach, modeling and simulation enabled them to design their functionality in such a way that met the high-level requirements.

]]>
https://blogs.mathworks.com/student-lounge/2023/11/20/an-autonomous-quadruped-manipulator-for-pick-and-place-applications/feed/ 0