A lot of people are using voice commands on their Raspberry Pi computers, so if you Google “raspberry pi voice recognition” you may find some code/tips on how to do this. (Most people run Linux on their Raspberry Pi, but these sorts of applications are by no means limited to the Pi and would be runnable on normal x86 hardware running Linux too, or could be ported to run on a Windows or Mac computer after sorting out the differences in how the microphone is accessed, etc.)
A quick Google turned up this: https://jasperproject.github.io/—I remember seeing a Python project a while back when I was curious about what sort of work goes into a thing like this, and Jasper seems to be written in Python so it may have been this project.
The general idea is to record from the microphone to get a user command, and upload the recording to the Google voice recognition API to have it transcribed. How to actually do the recording part can involve varying levels of complexity (the easiest being to have a “press to talk” button that the user holds while speaking to the robot, so that you know exactly when to begin recording and when to stop; the harder version (what Jasper uses IIRC) is to be “always recording” in intervals of a few seconds, evaluate the volume levels of what you recorded (e.g. to detect when somebody was saying something), and use that signal to start recording the user’s message. It doesn’t sound like a fun wheel to reinvent so if you can find a good off-the-shelf solution like Jasper that’d be the best way to go.